MyRobotLab / InMoov

inmoov repo
http://myrobotlab.org
90 stars 64 forks source link

ProgramAB in Chinese? #129

Open hairygael opened 6 years ago

hairygael commented 6 years ago

Do you guys think it's possible to use programAB with Chinese language? There is a request for that. I started the first AIML, but if it's not compatible with UTF-8, there is no use to go on. [https://github.com/MyRobotLab/inmoov/blob/develop/InMoov/chatbot/bots/ch/_inmoovChatbot.aiml]

moz4r commented 6 years ago

Is it ok with simplified chinese ?

hairygael commented 6 years ago

Hello Anthony, For now I used simplify chinese by default... Kevin is also suggesting using simplify as a start.

Gael Langevin Creator of InMoov InMoov Robot http://www.inmoov.fr @inmoov http://twitter.com/inmoov

2017-12-05 11:41 GMT+01:00 Anthony notifications@github.com:

Is it ok with simplified chinese ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MyRobotLab/inmoov/issues/129#issuecomment-349265570, or mute the thread https://github.com/notifications/unsubscribe-auth/AIF2x26hW_Xp3xDojj3pkbSRPpLxYWZtks5s9R3cgaJpZM4QzwuB .

kwatters commented 6 years ago

Hi Gael,
So, currently Chinese isn't supported in ProgramAB. The reason is: ProgramAB does not know where one word stops and the next word starts because in the Chinese language there are no spaces between words to delimit them. This is a problem known as "word segmentation" also known as "tokenization". There is some limited support in ProgramAB for Japanese currently, but I am not a native speaker so I can't talk to the accuracy of it, it looked pretty crude when I first looked at it, but who knows, maybe it does a good job. To make things a bit more complicated, Chinese actually can be written 3 different ways. (probably more) Traditional Chinese (These are decorative kanji characters and I think they're typically more formal.) Simplified Chinese (These are slightly simpler kanji characters that school children would learn) Pinyin (This is a phonetic transcription of the Chinese word using Latin characters. )

If we can get webkitspeech to return Pinyin from it's recognition, that would probably work right now as it is.

Traditional and simplified Chinese both have the problem of word segmentation. One issue is that you can write something in simplified chinese, or in traditional chinese, and they represent the exact same words, which means that we need to settle on one character set. I recommend we focus on simplified Chinese, as (I think..) it's slightly more common, but I'm not a Chinese speaker so I really can't comment on it with any authority.

So, long story short, no spaces in chinese text makes ProgramAB no worky, we need to introduce code into ProgramAB that can identify the start & stop of words in Chinese (maybe other langauges too!) so that the AIML will match properly.

Right now, AIML for Chinese will only work with an EXACT match of the input string.. (this isn't very useful.)

There are some libraries out there that can do word segmentation as this same technology is used in search engines, there are some tokenizers in Lucene-solr that might be able to do the trick for us. Otherwise, there's another library called icu4j that handles some of these things, and yet another one from Stanford.

I found some code here at stack overflow that is pretty relevant to what we need to do to make it work.:

https://stackoverflow.com/questions/12484019/how-to-tokenize-chinese-language-document

hairygael commented 6 years ago

Thanks Kevin for all this information. I have sent the thread link to the Chinese person which is concerned about the project in order to start defining what we should select between the three options.

Gael Langevin Creator of InMoov InMoov Robot http://www.inmoov.fr @inmoov http://twitter.com/inmoov

2017-12-05 15:04 GMT+01:00 Kevin Watters notifications@github.com:

Hi Gael, So, currently Chinese isn't supported in ProgramAB. The reason is: ProgramAB does not know where one word stops and the next word starts because in the Chinese language there are no spaces between words to delimit them. This is a problem known as "word segmentation" also known as "tokenization". There is some limited support in ProgramAB for Japanese currently, but I am not a native speaker so I can't talk to the accuracy of it, it looked pretty crude when I first looked at it, but who knows, maybe it does a good job. To make things a bit more complicated, Chinese actually can be written 3 different ways. (probably more) Traditional Chinese (These are decorative kanji characters and I think they're typically more formal.) Simplified Chinese (These are slightly simpler kanji characters that school children would learn) Pinyin (This is a phonetic transcription of the Chinese word using Latin characters. )

If we can get webkitspeech to return Pinyin from it's recognition, that would probably work right now as it is.

Traditional and simplified Chinese both have the problem of word segmentation. One issue is that you can write something in simplified chinese, or in traditional chinese, and they represent the exact same words, which means that we need to settle on one character set. I recommend we focus on simplified Chinese, as (I think..) it's slightly more common, but I'm not a Chinese speaker so I really can't comment on it with any authority.

So, long story short, no spaces in chinese text makes ProgramAB no worky, we need to introduce code into ProgramAB that can identify the start & stop of words in Chinese (maybe other langauges too!) so that the AIML will match properly.

Right now, AIML for Chinese will only work with an EXACT match of the input string.. (this isn't very useful.)

There are some libraries out there that can do word segmentation as this same technology is used in search engines, there are some tokenizers in Lucene-solr that might be able to do the trick for us. Otherwise, there's another library called icu4j that handles some of these things, and yet another one from Stanford.

I found some code here at stack overflow that is pretty relevant to what we need to do to make it work.:

https://stackoverflow.com/questions/12484019/how-to- tokenize-chinese-language-document

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MyRobotLab/inmoov/issues/129#issuecomment-349313251, or mute the thread https://github.com/notifications/unsubscribe-auth/AIF2x-T3hyOkkKmMSwo5Czuo9gjVo5dMks5s9U1igaJpZM4QzwuB .