Kimtaro / ve

A linguistic framework that's easy to use.
MIT License
216 stars 25 forks source link

Breaking down infant speech #22

Open vietqhoang opened 9 years ago

vietqhoang commented 9 years ago

Not sure if this is within the scope of ve but here it goes...

Using:


Case 1

Actual:

string = 'しょれでびしょびしょになったー'

words  = Ve.in(:ja).words(string).map(&:word)
 => ["し", "ょれでびしょびしょになった", "ー"] 

Expected:

 => ["しょれで", "びしょびしょ", "になったー"]

Case 2

Actual:

string = 'じゃさしいもん'

words  = Ve.in(:ja).words(string).map(&:word)
 => ["じゃ", "さしい", "もん"] 

Expected:

 => ["じゃさしい", "もん"] 
Kimtaro commented 9 years ago

I think that the main issue it that the dictionary mecab is using doesn't have many kana only words, so it doesn't know what to do with long strings of kana.

But you can add words to a custom dictionary and have mecab use that in addition to the main dictionary. I do this in beta Jisho to support words from JMdict and Wikipedia.

So if you added しょれ and じゃさしい as words to the dictionary it might be able to understand these sentences. I say might because I haven't tried this with kana only words and sentences.

The mecab site has a page on adding words: http://mecab.googlecode.com/svn/trunk/mecab/doc/dic.html and Ve allows you to pass command line options so you can tell mecab to start with the dictionary loaded.

There's a few quirks to be aware of though. For example, you can't modify a running mecab's dictionary, so you have to build a different filename each time. I use the suffices A and B. You must also specify a PoS that exists in the main dictionary you are building from.

I guess ideally I should clean up the code I have around this and release it, but it'd take a while probably :/

Let me know if you have any questions about this!