Kimtaro / ve

A linguistic framework that's easy to use.
MIT License
216 stars 25 forks source link

Parsing issues #23

Open vietqhoang opened 9 years ago

vietqhoang commented 9 years ago

Case 1

Actual:

string = 'おつまみ'

words  = Ve.in(:ja).words(string).map(&:word)
 => ["お", "つまみ"] 
parts_of_speeches = Ve.in(:ja).words(string).map(&:part_of_speech)
 => [Ve::PartOfSpeech::Prefix, Ve::PartOfSpeech::Verb] 

Expected:

words  = Ve.in(:ja).words(string).map(&:word)
 => ["おつまみ"] 
parts_of_speeches = Ve.in(:ja).words(string).map(&:part_of_speech)
 => [Ve::PartOfSpeech::Noun] 

Case 2

Actual:

string = 'やぐら'

words  = Ve.in(:ja).words(string).map(&:word)
 => ["や", "ぐら"] 
parts_of_speeches = Ve.in(:ja).words(string).map(&:part_of_speech)
 => [Ve::PartOfSpeech::Postposition, Ve::PartOfSpeech::Noun] 

Expected:

words  = Ve.in(:ja).words(string).map(&:word)
 => ["やぐら"] 
parts_of_speeches = Ve.in(:ja).words(string).map(&:part_of_speech)
 => [Ve::PartOfSpeech::Noun] 

Case 3

Actual:

string = '煮っころがし'

words  = Ve.in(:ja).words(string).map(&:word)
 => ["煮っ", "ころ", "が", "し"] 
parts_of_speeches = Ve.in(:ja).words(string).map(&:part_of_speech)
 => [Ve::PartOfSpeech::Verb, Ve::PartOfSpeech::Noun, Ve::PartOfSpeech::Postposition, Ve::PartOfSpeech::Verb]

Expected:

words  = Ve.in(:ja).words(string).map(&:word)
 => ["煮っころがし"] 
parts_of_speeches = Ve.in(:ja).words(string).map(&:part_of_speech)
 => [Ve::PartOfSpeech::Noun] 
Kimtaro commented 9 years ago

Ah, the ambiguities of language! :)

All breakdowns here, both the actual and the expected are ok parsings of these sentences.

Personally I prefer the prefix お to be parsed as a separate word. But you could either write some post processing logic to combine prefix-お with the following word.

For やぐら and 煮っころがし you could add them as words to a custom dictionary like I explained in #22. But there is no guarantee that mecab will parse them correctly even so, it depends on cost values.