RasaHQ / rasa

๐Ÿ’ฌ Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
https://rasa.com/docs/rasa/
Apache License 2.0
18.94k stars 4.64k forks source link

MITIE and Chinese support #972

Closed amn41 closed 6 years ago

amn41 commented 6 years ago

Attn: users who use Rasa NLU for Chinese. Could you please try your datasets (at least intent classification) with the new tensorflow_embedding pipeline? We would love to know how the performance is.

We are thinking of dropping support for MITIE because training times are long, and in our regular performance benchmarks it doesn't show any advantages in terms of performance.

However, to my knowledge most users who use Rasa to do Chinese NLU use MITIE, so I would love to understand how well alternatives do there.

wrathagom commented 6 years ago

lol, always one for brevity. Though I am assuming the No description provided should read something like:

Since removing MITIE we've discovered that MITIE was the closest/easiest path for our users to get Chinese NLU working. Now that we've removed it we may have to add it back for Chinese support or work to get spacy understanding Chinese.

Not trying to put words in your mouth or anything ;) Linking a couple issues here just for cross reference.

975

705

howl-anderson commented 6 years ago

As a spaCy contributor, I am currently working on adding Chinese language supporting to spaCy. Actually I already communicated with spaCy official developer about this. They are also working on this topic very hard. I will cooperate with the spaCy developer to complete this project. I don't know the release date of spaCy with Chinese language supporting, but It will be released with a good performance in the near future. If there are more details, I will keep the RSAS community updated.

tmbo commented 6 years ago

That sounds very promising!

winner484 commented 6 years ago

@howl-anderson ๐Ÿ‘
ๅคงๆฆ‚่ฆๅคšไน…๏ผŸ1ไธชๆœˆ่ƒฝ่กŒๅ—๏ผŸ

howl-anderson commented 6 years ago

Q: How long it will take before release of spaCy model with Chinese language supporting? (2018-04-11) A: Itโ€™s hard to tell when the model will be released. Because the model must be tested that show a good/acceptable performance. spaCy also need make several changes to support Chinese, Japanese and Vietnamese. This will take time too.

winner484 commented 6 years ago

@howl-anderson thank you!

winner484 commented 6 years ago

sapCy support Chinese language now. https://github.com/howl-anderson/Chinese_models_for_SpaCy but entities can not be detected. need help!

ubuntu -- python3.5 -- "rasa_nlu_version": "0.12.3"-- spaCy 2-- step: 1 install zh_core_web_sm 2 python3 -m spacy link zh_core_web_sm zh 3 train issue: intent OK ,but don`t have any entities. NEED HELP!

wrathagom commented 6 years ago

@winner484 is it ner_crf not returning any entities or ner_spacy?

winner484 commented 6 years ago

@wrathagom never had an entity return. i have tried many text, but never had any entities return

winner484 commented 6 years ago

the metadata.json in Model is : { "training_data": "training_data.json", "pipeline": [ { "case_sensitive": false, "model": "zh", "class": "rasa_nlu.utils.spacy_utils.SpacyNLP", "name": "nlp_spacy" }, { "class": "rasa_nlu.tokenizers.spacy_tokenizer.SpacyTokenizer", "name": "tokenizer_spacy" }, { "class": "rasa_nlu.featurizers.spacy_featurizer.SpacyFeaturizer", "name": "intent_featurizer_spacy" }, { "regex_file": "regex_featurizer.json", "class": "rasa_nlu.featurizers.regex_featurizer.RegexFeaturizer", "name": "intent_entity_featurizer_regex" }, { "class": "rasa_nlu.extractors.crf_entity_extractor.CRFEntityExtractor", "max_iterations": 50, "features": [ [ "low", "title", "upper", "pos", "pos2" ], [ "bias", "low", "word3", "word2", "upper", "title", "digit", "pos", "pos2", "pattern" ], [ "low", "title", "upper", "pos", "pos2" ] ], "L1_c": 1, "name": "ner_crf", "L2_c": 0.001, "BILOU_flag": true, "classifier_file": "crf_model.pkl" }, { "class": "rasa_nlu.extractors.entity_synonyms.EntitySynonymMapper", "name": "ner_synonyms", "synonyms_file": "entity_synonyms.json" }, { "class": "rasa_nlu.classifiers.sklearn_intent_classifier.SklearnIntentClassifier", "name": "intent_classifier_sklearn", "classifier_file": "intent_classifier_sklearn.pkl", "max_cross_validation_folds": 5, "C": [ 1, 2, 5, 10, 20, 100 ], "kernels": [ "linear" ] } ], "trained_at": "20180503-103724", "language": "zh", "rasa_nlu_version": "0.12.3" }

winner484 commented 6 years ago

$ curl -X POST localhost:5000/parse -d '{"q":"ๆˆ‘ๆƒณๅƒ็ซ้”…"}' | python -m json.tool % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 645 0 622 100 23 17210 636 --:--:-- --:--:-- --:--:-- 17771 { "entities": [], "intent": { "confidence": 0.45199854018449354, "name": "restaurant_search" }, "intent_ranking": [ { "confidence": 0.45199854018449354, "name": "restaurant_search" }, { "confidence": 0.3750782818220956, "name": "medical" }, { "confidence": 0.11279676245958703, "name": "affirm" }, { "confidence": 0.04185093011383089, "name": "goodbye" }, { "confidence": 0.018275485419993073, "name": "greet" } ], "model": "model_20180503-103724", "project": "default", "text": "\u6211\u60f3\u5403\u706b\u9505" }

winner484 commented 6 years ago

and the part of training data about the entity "็ซ้”…" is here:

{ "text": "ๆˆ‘ๆƒณๅƒ็ซ้”…ๅ•Š", "intent": "restaurant_search", "entities": [ { "start": 2, "end": 5, "value": "็ซ้”…", "entity": "food" } ] },

wrathagom commented 6 years ago

@winner484 speaking without being able to read the language ๐Ÿ˜… are you providing more entity examples than just that? entities can take a lot of data to train. Also if ็ซ้”… is really the entity then it is mislabeled. I believe the training data should have a range from 3 to 5 instead of 2 to 5.

{
  "text": "ๆˆ‘ๆƒณๅƒ็ซ้”…ๅ•Š",
    "intent": "restaurant_search",
    "entities": [
      {
        "start": 3,
        "end": 5,
        "value": "็ซ้”…",
        "entity": "food"
      }
    ]
},
howl-anderson commented 6 years ago

@wrathagom @winner484 Just for the record, although https://github.com/howl-anderson/Chinese_models_for_SpaCy currently is the only SpaCy model that support Chinese language but it is not the official Chinese language models for SpaCy, and most importantly it is still working in progress. Named Entity Recognition (AKA NER) is currently (2018-05-03) not supported, I am still working on it.

winner484 commented 6 years ago

@howl-anderson thank you for your great work! may i learn from you , maybe i could help you finish the job?

winner484 commented 6 years ago

@wrathagom thank you

buivietan commented 6 years ago

@amn41 I am using Rasa to do Japanese NLU with MITIE and the result is quite good. My config_mitie_ja.yml is: language: "ja"

pipeline:

My result after training model: {'entities': [{'extractor': 'ner_mitie', 'start': 0, 'confidence': None, 'value': 'ๅƒ่‘‰', 'end': 2, 'entity': 'ใƒญใ‚ฑใƒผใ‚ทใƒงใƒณ'}], 'intent': {'confidence': 0.9422146832263528, 'name': 'ใƒฌใ‚นใƒˆใƒฉใƒณใ‚’ๆคœ็ดขใ™ใ‚‹'}, 'intent_ranking': [{'confidence': 0.9422146832263528, 'name': 'ใƒฌใ‚นใƒˆใƒฉใƒณใ‚’ๆคœ็ดขใ™ใ‚‹'}, {'confidence': 0.038330105668737326, 'name': '่‚ฏๅฎšใ™ใ‚‹'}, {'confidence': 0.011094799507902988, 'name': 'ใ•ใ‚ˆใ†ใชใ‚‰'}, {'confidence': 0.008360411597006933, 'name': 'ๆŒจๆ‹ถใ™ใ‚‹'}], 'text': 'ๅƒ่‘‰ใซใƒฌใ‚นใƒˆใƒฉใƒณใ‚’ๆŽขใ—ใŸใ„ใ€‚'}

tmbo commented 6 years ago

very cool :+1: I think it might sense to provide default configurations for different languages to make it even esaiser to get started with a certain language. thoughts?

geekboood commented 6 years ago

@amn41 I didn't understand how the Supervised Word Vectors work before the corpus feed into tensorflow model yet. Could I just segment a Chinese sentence using some tokenizer such as Jieba, and then join the result with space. Then I put it into the count_vectors_featurizer (maybe I should tweak some parameters here). The result goes straight into the tensorflow_embedding part. Should the above procedure work?

howl-anderson commented 6 years ago

@geekboood As far as I know, it should worked. Also I am working on a PR to make sure count_vectors_featurizer can also using feature tokens which provide by tokenizers such as Jieba. It will be released soon. It is released at #1115

ilham-bintang commented 5 years ago

@amn41 I am using Rasa to do Japanese NLU with MITIE and the result is quite good. My config_mitie_ja.yml is: language: "ja"

pipeline:

  • name: "nlp_mitie" model: "mitie/total_word_feature_extractor_ja.dat"
  • name: "tokenizer_japanese" # I used tinysegmenter as Japanese tokenizer
  • name: "ner_mitie"
  • name: "ner_synonyms"
  • name: "intent_featurizer_mitie"
  • name: "intent_classifier_sklearn" # I modified the intent classifier. Instead of GridSearchCV I used linear model with Logistic regression as intent classifier.

My result after training model: {'entities': [{'extractor': 'ner_mitie', 'start': 0, 'confidence': None, 'value': 'ๅƒ่‘‰', 'end': 2, 'entity': 'ใƒญใ‚ฑใƒผใ‚ทใƒงใƒณ'}], 'intent': {'confidence': 0.9422146832263528, 'name': 'ใƒฌใ‚นใƒˆใƒฉใƒณใ‚’ๆคœ็ดขใ™ใ‚‹'}, 'intent_ranking': [{'confidence': 0.9422146832263528, 'name': 'ใƒฌใ‚นใƒˆใƒฉใƒณใ‚’ๆคœ็ดขใ™ใ‚‹'}, {'confidence': 0.038330105668737326, 'name': '่‚ฏๅฎšใ™ใ‚‹'}, {'confidence': 0.011094799507902988, 'name': 'ใ•ใ‚ˆใ†ใชใ‚‰'}, {'confidence': 0.008360411597006933, 'name': 'ๆŒจๆ‹ถใ™ใ‚‹'}], 'text': 'ๅƒ่‘‰ใซใƒฌใ‚นใƒˆใƒฉใƒณใ‚’ๆŽขใ—ใŸใ„ใ€‚'}

Hi. Where you can get mitie/total_word_feature_extractor_ja.dat ?

aparnak123 commented 4 years ago

@wrathagom @winner484 Just for the record, although https://github.com/howl-anderson/Chinese_models_for_SpaCy currently is the only SpaCy model that support Chinese language but it is not the official Chinese language models for SpaCy, and most importantly it is still working in progress. Named Entity Recognition (AKA NER) is currently (2018-05-03) not supported, I am still working on it.

Is there any progress

howl-anderson commented 4 years ago

@aparnak123 Hi, NER is supported now. See news https://github.com/howl-anderson/Chinese_models_for_SpaCy/blob/master/README.en-US.md#ner-new .