Closed amn41 closed 6 years ago
I will take a look on the NER CRF for Bengali/Hindi. this is a very useful feature
I apply the feature [WIP]New feature: component `count_vectors_featurizer
can use tokens
provide by tokenizers` provided by @howl-anderson first. Here is my pipeline setting.
pipeline:
- name: "tokenizer_jieba"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
My training dataset is demo-rasa_zh_medical.json
from Rasa_NLU_Chi
.
And the performance is not very well.
Here is my test cases.
{'intent_ranking': [{'confidence': 0.9747257828712463, 'name': 'movie.search'}, {'confidence': 0.013274772092700005, 'name': 'other'}, {'confidence': -0.09033790230751038, 'name': 'movie.recommend'}, {'confidence': -0.1267366111278534, 'name': 'weather'}, {'confidence': -0.31307536363601685, 'name': 'movie.filter'}], 'intent': {'confidence': 0.9747257828712463, 'name': 'movie.search'}, 'entities': [], 'text': '北京天气如何'}
>
åŒ—äº¬çš„å¤©æ°”æ€Žä¹ˆæ ·
['北京', 'çš„', '天气', 'æ€Žä¹ˆæ ·']
{'intent_ranking': [{'confidence': 0.9731086492538452, 'name': 'weather'}, {'confidence': 0.014019887894392014, 'name': 'other'}, {'confidence': -0.07188774645328522, 'name': 'movie.search'}, {'confidence': -0.19051238894462585, 'name': 'movie.recommend'}, {'confidence': -0.19962245225906372, 'name': 'movie.filter'}], 'intent': {'confidence': 0.9731086492538452, 'name': 'weather'}, 'entities': [], 'text': 'åŒ—äº¬çš„å¤©æ°”æ€Žä¹ˆæ ·'}
>
今天天气如何
['今天天气', '如何']
{'intent_ranking': [{'confidence': 0.9747257828712463, 'name': 'movie.search'}, {'confidence': 0.013274772092700005, 'name': 'other'}, {'confidence': -0.09033790230751038, 'name': 'movie.recommend'}, {'confidence': -0.1267366111278534, 'name': 'weather'}, {'confidence': -0.31307536363601685, 'name': 'movie.filter'}], 'intent': {'confidence': 0.9747257828712463, 'name': 'movie.search'}, 'entities': [], 'text': '今天天气如何'}
>
今天天津的天气如何啊
['今天', '天津', '的', '天气', '如何', '啊']
{'intent_ranking': [{'confidence': 0.9874239563941956, 'name': 'weather'}, {'confidence': -0.06193627044558525, 'name': 'other'}, {'confidence': -0.10760164260864258, 'name': 'movie.search'}, {'confidence': -0.14756201207637787, 'name': 'movie.recommend'}, {'confidence': -0.2833463251590729, 'name': 'movie.filter'}], 'intent': {'confidence': 0.9874239563941956, 'name': 'weather'}, 'entities': [], 'text': '今天天津的天气如何啊'}
>
北京的天气如何
['北京', '的', '天气', '如何']
{'intent_ranking': [{'confidence': 0.9747257828712463, 'name': 'movie.search'}, {'confidence': 0.013274772092700005, 'name': 'other'}, {'confidence': -0.09033790230751038, 'name': 'movie.recommend'}, {'confidence': -0.1267366111278534, 'name': 'weather'}, {'confidence': -0.31307536363601685, 'name': 'movie.filter'}], 'intent': {'confidence': 0.9747257828712463, 'name': 'movie.search'}, 'entities': [], 'text': '北京的天气如何'}
>
As you can see, when the input sentence is exactly same as it in the dataset, the classifier can do it job very well. However, when I change some word, the classifier classify it totally wrong.
As far as I am concerned, the count_vectors_featurizer
may just extract some shallow features. Maybe we should use some word vectors represent the word which are more capable to provide inner features.
@geekboood when I change some word
- do you mean, the classifier never saw this word during training?
@Ghostvv Probably. You mean this classifier isn't designed for classifying some sentences which contains some words that the classifier never saw them?
@Ghostvv Could I pass more features to the embedding_intent_classifier
, such as name entity and parse tree?
@Ghostvv Also, the old pipeline performs better. Here is the setting.
language: "zh"
pipeline:
- name: "nlp_mitie"
model: "data/total_word_feature_extractor_zh.dat"
- name: "tokenizer_jieba"
- name: "ner_mitie"
- name: "ner_synonyms"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_mitie"
- name: "intent_classifier_sklearn"
åŒ—äº¬çš„å¤©æ°”æ€Žä¹ˆæ ·
{'entities': [{'start': 0, 'entity': 'location', 'extractor': 'ner_mitie', 'end': 2, 'confidence': None, 'value': '北京'}], 'intent_ranking': [{'name': 'weather', 'confidence': 0.73685717906111781}, {'name': 'movie.recommend', 'confidence': 0.079560079601130859}, {'name': 'movie.search', 'confidence': 0.067201055818077352}, {'name': 'movie.filter', 'confidence': 0.058952587054810698}, {'name': 'other', 'confidence': 0.057429098464862917}], 'intent': {'name': 'weather', 'confidence': 0.73685717906111781}, 'text': 'åŒ—äº¬çš„å¤©æ°”æ€Žä¹ˆæ ·'}
>
今天天气如何
{'entities': [{'start': 0, 'entity': 'date_relative', 'extractor': 'ner_mitie', 'end': 4, 'confidence': None, 'value': '今天天气'}], 'intent_ranking': [{'name': 'weather', 'confidence': 0.85739979078626116}, {'name': 'movie.recommend', 'confidence': 0.048164508418653974}, {'name': 'movie.filter', 'confidence': 0.040367227568456697}, {'name': 'other', 'confidence': 0.029673993993817607}, {'name': 'movie.search', 'confidence': 0.024394479232810646}], 'intent': {'name': 'weather', 'confidence': 0.85739979078626116}, 'text': '今天天气如何'}
>
今天天津的天气如何啊
{'entities': [{'start': 0, 'entity': 'date_relative', 'extractor': 'ner_mitie', 'end': 2, 'confidence': None, 'value': '今天'}, {'start': 2, 'entity': 'location', 'extractor': 'ner_mitie', 'end': 4, 'confidence': None, 'value': '天津'}], 'intent_ranking': [{'name': 'weather', 'confidence': 0.73687145718610414}, {'name': 'movie.recommend', 'confidence': 0.083727992347842381}, {'name': 'movie.search', 'confidence': 0.068220347303949214}, {'name': 'other', 'confidence': 0.058831686730290357}, {'name': 'movie.filter', 'confidence': 0.052348516431814048}], 'intent': {'name': 'weather', 'confidence': 0.73687145718610414}, 'text': '今天天津的天气如何啊'}
>
北京的天气如何
{'entities': [{'start': 0, 'entity': 'location', 'extractor': 'ner_mitie', 'end': 2, 'confidence': None, 'value': '北京'}], 'intent_ranking': [{'name': 'weather', 'confidence': 0.7399888132057717}, {'name': 'movie.filter', 'confidence': 0.094004024297399458}, {'name': 'movie.recommend', 'confidence': 0.063420885821614453}, {'name': 'movie.search', 'confidence': 0.061949089608442351}, {'name': 'other', 'confidence': 0.040637187066771885}], 'intent': {'name': 'weather', 'confidence': 0.7399888132057717}, 'text': '北京的天气如何'}
>
@geekboood You mean this classifier isn't designed for classifying some sentences which contains some words that the classifier never saw them?
- if it contains only some
words, it is fine if they are not critical. Otherwise, it cannot learn something it never saw during training, because it doesn't have preloaded word vectors.
Yes, you can pass more features, but you need to add your custom featurizer to the pipeline.
The performance depends on your training data and how well the word vectors are related to your domain.
@Ghostvv How could I add some word vectors?
@geekboood you could add word vectors as features if you use "intent_featurizer_mitie" instead of "intent_featurizer_count_vectors". However, I'm not sure if it produces better results.
@Ghostvv Thank you so much! I change the intent_featurizer_count_vectors
to intent_featurizer_mitie
and the result is so much better!
@geekboood embedding_intent_classifier
has OOV issue, it can not using anything that don't seen during the training. Models based on pre-trained vector (e.g. MITIE, SpaCy) don't have such serious OOV issue, it should have better performance in a small training set.
@howl-anderson It seems like that. By the way, could we transplant a NER module based on NN? Such as a ID-CNN-CRF module? I think it will work better than the mitie_entity_extractor
module.
@geekboood Good idea! There are may be a potential issue: DNN-based NER-extractors (e.g. ID-CNN-CRF as you said) can have a better performance then MITIE-based extractor in a large corpus, but when apply to small corpus which very common to RASA NLU application, it maybe (just a guess, I am not pretty sure) will happen: those DNN models that trained from scratch have a poor performance then those pre-trained models such as MITIE, SpaCy. Also it seems SpaCy v2 already using NN to training NER extractor. Currently SpaCy v2 not support Chinese language, but I am working on it, see https://github.com/howl-anderson/Chinese_models_for_SpaCy
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please create a new issue if you need more help.
With PR https://github.com/RasaHQ/rasa_nlu/pull/1095 we have an embedding + crf pipeline which can do intent and entity recognition in any language.
If you are currently testing this and are willing to share your results, please do so! Benchmarks together with datasets are especially welcome.
We expect performance to vary:
So far, I've heard of people in the community using this pipeline for:
and I'm sure there are many more. Please let us know!