Closed amn41 closed 6 years ago
lol, always one for brevity. Though I am assuming the No description provided
should read something like:
Since removing MITIE we've discovered that MITIE was the closest/easiest path for our users to get Chinese NLU working. Now that we've removed it we may have to add it back for Chinese support or work to get spacy understanding Chinese.
Not trying to put words in your mouth or anything ;) Linking a couple issues here just for cross reference.
As a spaCy contributor, I am currently working on adding Chinese language supporting to spaCy. Actually I already communicated with spaCy official developer about this. They are also working on this topic very hard. I will cooperate with the spaCy developer to complete this project. I don't know the release date of spaCy with Chinese language supporting, but It will be released with a good performance in the near future. If there are more details, I will keep the RSAS community updated.
That sounds very promising!
@howl-anderson ๐
ๅคงๆฆ่ฆๅคไน
๏ผ1ไธชๆ่ฝ่กๅ๏ผ
Q: How long it will take before release of spaCy model with Chinese language supporting? (2018-04-11) A: Itโs hard to tell when the model will be released. Because the model must be tested that show a good/acceptable performance. spaCy also need make several changes to support Chinese, Japanese and Vietnamese. This will take time too.
@howl-anderson thank you!
ubuntu -- python3.5 -- "rasa_nlu_version": "0.12.3"-- spaCy 2-- step: 1 install zh_core_web_sm 2 python3 -m spacy link zh_core_web_sm zh 3 train issue: intent OK ,but don`t have any entities. NEED HELP!
@winner484 is it ner_crf
not returning any entities or ner_spacy
?
@wrathagom never had an entity return. i have tried many text, but never had any entities return
the metadata.json in Model is : { "training_data": "training_data.json", "pipeline": [ { "case_sensitive": false, "model": "zh", "class": "rasa_nlu.utils.spacy_utils.SpacyNLP", "name": "nlp_spacy" }, { "class": "rasa_nlu.tokenizers.spacy_tokenizer.SpacyTokenizer", "name": "tokenizer_spacy" }, { "class": "rasa_nlu.featurizers.spacy_featurizer.SpacyFeaturizer", "name": "intent_featurizer_spacy" }, { "regex_file": "regex_featurizer.json", "class": "rasa_nlu.featurizers.regex_featurizer.RegexFeaturizer", "name": "intent_entity_featurizer_regex" }, { "class": "rasa_nlu.extractors.crf_entity_extractor.CRFEntityExtractor", "max_iterations": 50, "features": [ [ "low", "title", "upper", "pos", "pos2" ], [ "bias", "low", "word3", "word2", "upper", "title", "digit", "pos", "pos2", "pattern" ], [ "low", "title", "upper", "pos", "pos2" ] ], "L1_c": 1, "name": "ner_crf", "L2_c": 0.001, "BILOU_flag": true, "classifier_file": "crf_model.pkl" }, { "class": "rasa_nlu.extractors.entity_synonyms.EntitySynonymMapper", "name": "ner_synonyms", "synonyms_file": "entity_synonyms.json" }, { "class": "rasa_nlu.classifiers.sklearn_intent_classifier.SklearnIntentClassifier", "name": "intent_classifier_sklearn", "classifier_file": "intent_classifier_sklearn.pkl", "max_cross_validation_folds": 5, "C": [ 1, 2, 5, 10, 20, 100 ], "kernels": [ "linear" ] } ], "trained_at": "20180503-103724", "language": "zh", "rasa_nlu_version": "0.12.3" }
$ curl -X POST localhost:5000/parse -d '{"q":"ๆๆณๅ็ซ้ "}' | python -m json.tool % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 645 0 622 100 23 17210 636 --:--:-- --:--:-- --:--:-- 17771 { "entities": [], "intent": { "confidence": 0.45199854018449354, "name": "restaurant_search" }, "intent_ranking": [ { "confidence": 0.45199854018449354, "name": "restaurant_search" }, { "confidence": 0.3750782818220956, "name": "medical" }, { "confidence": 0.11279676245958703, "name": "affirm" }, { "confidence": 0.04185093011383089, "name": "goodbye" }, { "confidence": 0.018275485419993073, "name": "greet" } ], "model": "model_20180503-103724", "project": "default", "text": "\u6211\u60f3\u5403\u706b\u9505" }
and the part of training data about the entity "็ซ้ " is here:
{ "text": "ๆๆณๅ็ซ้ ๅ", "intent": "restaurant_search", "entities": [ { "start": 2, "end": 5, "value": "็ซ้ ", "entity": "food" } ] },
@winner484 speaking without being able to read the language ๐
are you providing more entity examples than just that? entities can take a lot of data to train. Also if ็ซ้
is really the entity then it is mislabeled. I believe the training data should have a range from 3 to 5 instead of 2 to 5.
{
"text": "ๆๆณๅ็ซ้
ๅ",
"intent": "restaurant_search",
"entities": [
{
"start": 3,
"end": 5,
"value": "็ซ้
",
"entity": "food"
}
]
},
@wrathagom @winner484 Just for the record, although https://github.com/howl-anderson/Chinese_models_for_SpaCy currently is the only SpaCy model that support Chinese language but it is not the official Chinese language models for SpaCy, and most importantly it is still working in progress. Named Entity Recognition (AKA NER) is currently (2018-05-03) not supported, I am still working on it.
@howl-anderson thank you for your great work! may i learn from you , maybe i could help you finish the job?
@wrathagom thank you
@amn41 I am using Rasa to do Japanese NLU with MITIE and the result is quite good. My config_mitie_ja.yml is: language: "ja"
pipeline:
My result after training model: {'entities': [{'extractor': 'ner_mitie', 'start': 0, 'confidence': None, 'value': 'ๅ่', 'end': 2, 'entity': 'ใญใฑใผใทใงใณ'}], 'intent': {'confidence': 0.9422146832263528, 'name': 'ใฌในใใฉใณใๆค็ดขใใ'}, 'intent_ranking': [{'confidence': 0.9422146832263528, 'name': 'ใฌในใใฉใณใๆค็ดขใใ'}, {'confidence': 0.038330105668737326, 'name': '่ฏๅฎใใ'}, {'confidence': 0.011094799507902988, 'name': 'ใใใใชใ'}, {'confidence': 0.008360411597006933, 'name': 'ๆจๆถใใ'}], 'text': 'ๅ่ใซใฌในใใฉใณใๆขใใใใ'}
very cool :+1: I think it might sense to provide default configurations for different languages to make it even esaiser to get started with a certain language. thoughts?
@amn41 I didn't understand how the Supervised Word Vectors work before the corpus feed into tensorflow model yet. Could I just segment a Chinese sentence using some tokenizer such as Jieba, and then join the result with space. Then I put it into the count_vectors_featurizer (maybe I should tweak some parameters here). The result goes straight into the tensorflow_embedding part. Should the above procedure work?
@geekboood As far as I know, it should worked. Also I am working on a PR to make sure count_vectors_featurizer
can also using feature tokens
which provide by tokenizers such as Jieba. It will be released soon. It is released at #1115
@amn41 I am using Rasa to do Japanese NLU with MITIE and the result is quite good. My config_mitie_ja.yml is: language: "ja"
pipeline:
- name: "nlp_mitie" model: "mitie/total_word_feature_extractor_ja.dat"
- name: "tokenizer_japanese" # I used tinysegmenter as Japanese tokenizer
- name: "ner_mitie"
- name: "ner_synonyms"
- name: "intent_featurizer_mitie"
- name: "intent_classifier_sklearn" # I modified the intent classifier. Instead of GridSearchCV I used linear model with Logistic regression as intent classifier.
My result after training model: {'entities': [{'extractor': 'ner_mitie', 'start': 0, 'confidence': None, 'value': 'ๅ่', 'end': 2, 'entity': 'ใญใฑใผใทใงใณ'}], 'intent': {'confidence': 0.9422146832263528, 'name': 'ใฌในใใฉใณใๆค็ดขใใ'}, 'intent_ranking': [{'confidence': 0.9422146832263528, 'name': 'ใฌในใใฉใณใๆค็ดขใใ'}, {'confidence': 0.038330105668737326, 'name': '่ฏๅฎใใ'}, {'confidence': 0.011094799507902988, 'name': 'ใใใใชใ'}, {'confidence': 0.008360411597006933, 'name': 'ๆจๆถใใ'}], 'text': 'ๅ่ใซใฌในใใฉใณใๆขใใใใ'}
Hi. Where you can get mitie/total_word_feature_extractor_ja.dat
?
@wrathagom @winner484 Just for the record, although https://github.com/howl-anderson/Chinese_models_for_SpaCy currently is the only SpaCy model that support Chinese language but it is not the official Chinese language models for SpaCy, and most importantly it is still working in progress. Named Entity Recognition (AKA NER) is currently (2018-05-03) not supported, I am still working on it.
Is there any progress
@aparnak123 Hi, NER is supported now. See news https://github.com/howl-anderson/Chinese_models_for_SpaCy/blob/master/README.en-US.md#ner-new .
Attn: users who use Rasa NLU for Chinese. Could you please try your datasets (at least intent classification) with the new
tensorflow_embedding
pipeline? We would love to know how the performance is.We are thinking of dropping support for MITIE because training times are long, and in our regular performance benchmarks it doesn't show any advantages in terms of performance.
However, to my knowledge most users who use Rasa to do Chinese NLU use MITIE, so I would love to understand how well alternatives do there.