Add Stanford CoreNLP as pipeline options to support Chinese and more languages

crownpku commented 7 years ago

Hi, all.

As we discussed on gitter, we'd like to add Chinese language support for Rasa NLU. The problem is that, since Chinese language does not have "white space" between words, tokenizing Chinese is not a trivial problem, not to mention entity and intent recognition.

Both spacy and MITIE do not support Chinese. And training a separate Chinese model for spacy or MITIE is not easy, and will probably not perform as well as existing Chinese NLP tools. (I did spend 30+ hours on training a Chinese word vector model on MITIE, and feel helpless on training another NER model on MITIE since I do not have the annotated data.)

I'm wondering if we can add Stanford CoreNLP as part of the pipeline modules. CoreNLP as I know is very powerful and is actively maintained (more actively than MITIE I guess...) for multiple languages, including Chinese.

Stanford CoreNLP can be integrated by using the python wrapper or directly host a CoreNLP server in the backend and use their restful APIs. I guess, with its tokenization, NER function, plus sklearn for intent classification, we can make a very good pipeline handling Chinese as well as English, German and Spanish.

I'm looking into Rasa NLU code and would be more than happy to contribute. However I doubt myself alone is able to handle the task, and I need your suggestions and help for making this happen. Thank you!

Best, crownpku

Tpt commented 7 years ago

As a user of CoreNLP with Python I would recommend to use the CoreNLP REST API. It's much easier to setup and manage.

An other possible way to add Chinese support to RasaNLU is maybe to train Spacy for that. It's easy to plug-in external tokenizers for Spacy (see [1] for Japanese) and training Spacy with UniversalDependencies datasets [2] is fairly easy.

[1] https://github.com/explosion/spaCy/blob/master/spacy/ja/__init__.py#L16 [2] https://github.com/UniversalDependencies/UD_Chinese

matteoredaelli commented 7 years ago

Nice feature: there is also a CoreNLP extension for the Italian language at https://github.com/dhfbk/tint

crownpku commented 7 years ago

@Tpt Thanks! I will definitely try training Spacy for Chinese support.

TDehaene commented 7 years ago

Hi @crownpku : any luck training Spacy for Chinese support so far? Really curious about any outcomes

crownpku commented 7 years ago

@TDehaene I used jieba (a Chinese tokenizer) + MITIE + sklearn to get pretty good Chinese NLU results. Still haven't tried Spacy for Chinese support...

buivietan commented 6 years ago

@crownpku I'm trying to add Japanese to rasa NLU, I'm using spaCy + janome (Japanese tokenizer) but the result is not so good (still cannot extract entities). I would like to asked if you could show me the way to add Chinese to rasa NLU by using Jieba + MITIE + sklearn. Thanks in advance.

crownpku commented 6 years ago

@buivietan I have a rasa_nlu fork here. You can try to search for jieba tokenizer and find the modifications. Most of the documents are in Chinese... anyways you can always ask me in the fork repository issues.

buivietan commented 6 years ago

@crownpku Thanks so much for your quick reply.

tmbo commented 6 years ago

@twerkmeister I think you should join this discussion as well :)

So we can get a version of this integrated into the main respository

crownpku commented 6 years ago

@tmbo I actually have a pull request here. The last problem was about language control within different modules I guess.

tmbo commented 6 years ago

There is a bit more to this. I had a disucssion with @twerkmeister earlier and he pointed to a couple of issues that we might face by just including it in the pipeline, but maybe he can comment on that.

twerkmeister commented 6 years ago

Thank you to everyone who contributed to this discussion so far! This is really helpful, and it's good to hear the MITIE pipeline is working well, @crownpku . I have been doing some tests on Chinese using a spacy-based pipeline using jieba and fasttext word vectors. The intent classification works well, but NER and some other components have problems:

NER_CRF is struggling with wrong token boundaries coming from the jieba tokenizer. This is probably why you also had trouble with the Japanese NER @buivietan. Using a segmentation dictionary for jieba, it might be possible to reduce the number of segmentation errors. However this is not supported by the standard spacy Chinese module. Another thing I want to try out is segmenting the characters one by one for within the NER_CRF component. That way, there can be no wrong boundaries and the CRF can be properly trained.
Intent_featurizer_ngrams is quite opinionated towards alphabetical languages and needs some adjustments to make sense for Chinese.

CoreNLP is a fine piece of software. However, CoreNLP uses the GPLv3 license which would require us to publish rasa_nlu under GPLv3 too if I am not mistaken (Does GPL also enforce the "same license" rule when it's an optional dependency?) Secondly, I wonder if we should put effort into supporting another major pipeline component that has large intersections with the spacy and mitie. @tmbo thoughts?

In the coming days I will do some more work on improving the compatibility of the spacy-related components with Chinese and add an NER benchmark to make sure the changes are useful. I'll keep you updated!

Running-He commented 6 years ago

@twerkmeister pretty good try, not sure some more update after your previous post ? looking forward to see how you make this works

wrathagom commented 6 years ago

if the goal of this thread is Chinese support there is another conversation going on at #705 and #972.

yingrui commented 5 years ago

Is it possible that the rasa defines the HTTP API? like the HttpComponent which will call other http server to implement the tokenizer and entity extractor. I guess this will not have the license issue, even if the Http API is similar to the Stanford NLP, and this will make rasa more open to other developers to integrate with their own NLP tools.

tmbo commented 5 years ago

yes . Ithink that is possible, but the discussion should take place on #4232

RasaHQ / rasa

Add Stanford CoreNLP as pipeline options to support Chinese and more languages #453