more language benchmarks

amn41 commented 6 years ago

With PR https://github.com/RasaHQ/rasa_nlu/pull/1095 we have an embedding + crf pipeline which can do intent and entity recognition in any language.

If you are currently testing this and are willing to share your results, please do so! Benchmarks together with datasets are especially welcome.

We expect performance to vary:

some languages will need a custom tokenizer (see https://github.com/RasaHQ/rasa_nlu/pull/1115)
morphologically rich languages will probably work less well.

So far, I've heard of people in the community using this pipeline for:

[ ] arabic
[ ] persian
[ ] hebrew
[ ] mandarin
[ ] japanese
[ ] bengali (see this post )
[ ] lithuanian
[ ] russian

and I'm sure there are many more. Please let us know!

souvikg10 commented 6 years ago

I will take a look on the NER CRF for Bengali/Hindi. this is a very useful feature

geekboood commented 6 years ago

I apply the feature [WIP]New feature: component `count_vectors_featurizer can use tokens provide by tokenizers` provided by @howl-anderson first. Here is my pipeline setting.


pipeline:
- name: "tokenizer_jieba"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"

My training dataset is demo-rasa_zh_medical.json from Rasa_NLU_Chi. And the performance is not very well. Here is my test cases.

{'intent_ranking': [{'confidence': 0.9747257828712463, 'name': 'movie.search'}, {'confidence': 0.013274772092700005, 'name': 'other'}, {'confidence': -0.09033790230751038, 'name': 'movie.recommend'}, {'confidence': -0.1267366111278534, 'name': 'weather'}, {'confidence': -0.31307536363601685, 'name': 'movie.filter'}], 'intent': {'confidence': 0.9747257828712463, 'name': 'movie.search'}, 'entities': [], 'text': '北京天气如何'}
>
北京的天气怎么样
['北京', '的', '天气', '怎么样']
{'intent_ranking': [{'confidence': 0.9731086492538452, 'name': 'weather'}, {'confidence': 0.014019887894392014, 'name': 'other'}, {'confidence': -0.07188774645328522, 'name': 'movie.search'}, {'confidence': -0.19051238894462585, 'name': 'movie.recommend'}, {'confidence': -0.19962245225906372, 'name': 'movie.filter'}], 'intent': {'confidence': 0.9731086492538452, 'name': 'weather'}, 'entities': [], 'text': '北京的天气怎么样'}
>
今天天气如何
['今天天气', '如何']
{'intent_ranking': [{'confidence': 0.9747257828712463, 'name': 'movie.search'}, {'confidence': 0.013274772092700005, 'name': 'other'}, {'confidence': -0.09033790230751038, 'name': 'movie.recommend'}, {'confidence': -0.1267366111278534, 'name': 'weather'}, {'confidence': -0.31307536363601685, 'name': 'movie.filter'}], 'intent': {'confidence': 0.9747257828712463, 'name': 'movie.search'}, 'entities': [], 'text': '今天天气如何'}
>
今天天津的天气如何啊
['今天', '天津', '的', '天气', '如何', '啊']
{'intent_ranking': [{'confidence': 0.9874239563941956, 'name': 'weather'}, {'confidence': -0.06193627044558525, 'name': 'other'}, {'confidence': -0.10760164260864258, 'name': 'movie.search'}, {'confidence': -0.14756201207637787, 'name': 'movie.recommend'}, {'confidence': -0.2833463251590729, 'name': 'movie.filter'}], 'intent': {'confidence': 0.9874239563941956, 'name': 'weather'}, 'entities': [], 'text': '今天天津的天气如何啊'}
>
北京的天气如何
['北京', '的', '天气', '如何']
{'intent_ranking': [{'confidence': 0.9747257828712463, 'name': 'movie.search'}, {'confidence': 0.013274772092700005, 'name': 'other'}, {'confidence': -0.09033790230751038, 'name': 'movie.recommend'}, {'confidence': -0.1267366111278534, 'name': 'weather'}, {'confidence': -0.31307536363601685, 'name': 'movie.filter'}], 'intent': {'confidence': 0.9747257828712463, 'name': 'movie.search'}, 'entities': [], 'text': '北京的天气如何'}
>

As you can see, when the input sentence is exactly same as it in the dataset, the classifier can do it job very well. However, when I change some word, the classifier classify it totally wrong. As far as I am concerned, the count_vectors_featurizer may just extract some shallow features. Maybe we should use some word vectors represent the word which are more capable to provide inner features.

Ghostvv commented 6 years ago

@geekboood when I change some word - do you mean, the classifier never saw this word during training?

geekboood commented 6 years ago

@Ghostvv Probably. You mean this classifier isn't designed for classifying some sentences which contains some words that the classifier never saw them?

geekboood commented 6 years ago

@Ghostvv Could I pass more features to the embedding_intent_classifier, such as name entity and parse tree?

geekboood commented 6 years ago

@Ghostvv Also, the old pipeline performs better. Here is the setting.

language: "zh"

pipeline:
- name: "nlp_mitie"
  model: "data/total_word_feature_extractor_zh.dat"
- name: "tokenizer_jieba"
- name: "ner_mitie"
- name: "ner_synonyms"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_mitie"
- name: "intent_classifier_sklearn"

北京的天气怎么样
{'entities': [{'start': 0, 'entity': 'location', 'extractor': 'ner_mitie', 'end': 2, 'confidence': None, 'value': '北京'}], 'intent_ranking': [{'name': 'weather', 'confidence': 0.73685717906111781}, {'name': 'movie.recommend', 'confidence': 0.079560079601130859}, {'name': 'movie.search', 'confidence': 0.067201055818077352}, {'name': 'movie.filter', 'confidence': 0.058952587054810698}, {'name': 'other', 'confidence': 0.057429098464862917}], 'intent': {'name': 'weather', 'confidence': 0.73685717906111781}, 'text': '北京的天气怎么样'}
>
今天天气如何
{'entities': [{'start': 0, 'entity': 'date_relative', 'extractor': 'ner_mitie', 'end': 4, 'confidence': None, 'value': '今天天气'}], 'intent_ranking': [{'name': 'weather', 'confidence': 0.85739979078626116}, {'name': 'movie.recommend', 'confidence': 0.048164508418653974}, {'name': 'movie.filter', 'confidence': 0.040367227568456697}, {'name': 'other', 'confidence': 0.029673993993817607}, {'name': 'movie.search', 'confidence': 0.024394479232810646}], 'intent': {'name': 'weather', 'confidence': 0.85739979078626116}, 'text': '今天天气如何'}
>
今天天津的天气如何啊
{'entities': [{'start': 0, 'entity': 'date_relative', 'extractor': 'ner_mitie', 'end': 2, 'confidence': None, 'value': '今天'}, {'start': 2, 'entity': 'location', 'extractor': 'ner_mitie', 'end': 4, 'confidence': None, 'value': '天津'}], 'intent_ranking': [{'name': 'weather', 'confidence': 0.73687145718610414}, {'name': 'movie.recommend', 'confidence': 0.083727992347842381}, {'name': 'movie.search', 'confidence': 0.068220347303949214}, {'name': 'other', 'confidence': 0.058831686730290357}, {'name': 'movie.filter', 'confidence': 0.052348516431814048}], 'intent': {'name': 'weather', 'confidence': 0.73687145718610414}, 'text': '今天天津的天气如何啊'}
>
北京的天气如何
{'entities': [{'start': 0, 'entity': 'location', 'extractor': 'ner_mitie', 'end': 2, 'confidence': None, 'value': '北京'}], 'intent_ranking': [{'name': 'weather', 'confidence': 0.7399888132057717}, {'name': 'movie.filter', 'confidence': 0.094004024297399458}, {'name': 'movie.recommend', 'confidence': 0.063420885821614453}, {'name': 'movie.search', 'confidence': 0.061949089608442351}, {'name': 'other', 'confidence': 0.040637187066771885}], 'intent': {'name': 'weather', 'confidence': 0.7399888132057717}, 'text': '北京的天气如何'}
>

Ghostvv commented 6 years ago

@geekboood You mean this classifier isn't designed for classifying some sentences which contains some words that the classifier never saw them? - if it contains only some words, it is fine if they are not critical. Otherwise, it cannot learn something it never saw during training, because it doesn't have preloaded word vectors.

Yes, you can pass more features, but you need to add your custom featurizer to the pipeline.

The performance depends on your training data and how well the word vectors are related to your domain.

geekboood commented 6 years ago

@Ghostvv How could I add some word vectors?

Ghostvv commented 6 years ago

@geekboood you could add word vectors as features if you use "intent_featurizer_mitie" instead of "intent_featurizer_count_vectors". However, I'm not sure if it produces better results.

geekboood commented 6 years ago

@Ghostvv Thank you so much! I change the intent_featurizer_count_vectors to intent_featurizer_mitie and the result is so much better!

howl-anderson commented 6 years ago

@geekboood embedding_intent_classifier has OOV issue, it can not using anything that don't seen during the training. Models based on pre-trained vector (e.g. MITIE, SpaCy) don't have such serious OOV issue, it should have better performance in a small training set.

geekboood commented 6 years ago

@howl-anderson It seems like that. By the way, could we transplant a NER module based on NN? Such as a ID-CNN-CRF module? I think it will work better than the mitie_entity_extractor module.

howl-anderson commented 6 years ago

@geekboood Good idea! There are may be a potential issue: DNN-based NER-extractors (e.g. ID-CNN-CRF as you said) can have a better performance then MITIE-based extractor in a large corpus, but when apply to small corpus which very common to RASA NLU application, it maybe (just a guess, I am not pretty sure) will happen: those DNN models that trained from scratch have a poor performance then those pre-trained models such as MITIE, SpaCy. Also it seems SpaCy v2 already using NN to training NER extractor. Currently SpaCy v2 not support Chinese language, but I am working on it, see https://github.com/howl-anderson/Chinese_models_for_SpaCy

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed due to inactivity. Please create a new issue if you need more help.

RasaHQ / rasa

more language benchmarks #1121