Train Chinese language - Githubissues

deepiksdev commented 7 years ago

See https://rasa-nlu.readthedocs.io/en/stable/languages.html

ALLENWAN1 commented 7 years ago

@deepiksdev I installed successfully rasa_nlu in my macbook, and when I ran the server according to tutorial, I got a error: ERROR:root:Failed to load model 'model_20170613-113720'. Error: 1 Which is the same as https://github.com/RasaHQ/rasa_nlu/issues/284 The I updated the repo of rasa_nlu and reinstall rasa_nlu and all his dependencies in requirement.txtand dev-requirement.txt, then it works. So I suggest that we should update the repo.

ALLENWAN1 commented 7 years ago

@deepiksdev Successfully tested the English version. And I sent a request to the sever curl -XPOST localhost:5000/parse -d '{"q":"I am looking for Chinese food"}' | python -mjson.tool and I got

{
    "entities": [
        {
            "end": 24,
            "entity": "NORP",
            "extractor": "ner_spacy",
            "start": 17,
            "value": "Chinese"
        }
    ],
    "intent": {
        "confidence": 0.7805998076148585,
        "name": "restaurant_search"
    },
    "intent_ranking": [
        {
            "confidence": 0.7805998076148585,
            "name": "restaurant_search"
        },
        {
            "confidence": 0.0941985709844514,
            "name": "greet"
        },
        {
            "confidence": 0.08374048180790579,
            "name": "goodbye"
        },
        {
            "confidence": 0.04146113959278425,
            "name": "affirm"
        }
    ],
    "text": "I am looking for Chinese food"
}

ALLENWAN1 commented 7 years ago

@deepiksdev I finished the segmentation of Chinese datasets of wikipedia.

I downloaded the datasets from https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 2.translate the dataset to txt file.
use opencc for translating traditional Chinese to simplified Chinese.
translate text to utf-8
use library jieba for segmentation of text.
I wrote a python file for processing the text which can generate a frequency file.

ALLENWAN1 commented 7 years ago

@deepiksdev I created a repo in s3, and upload 4 files to s3. You can find them inhttps://s3.amazonaws.com/deepiks-training/datasets/rasa_chinese

deepiksdev commented 7 years ago

@ALLENWAN1 Great, thanks. Please commit the python file.

ALLENWAN1 commented 7 years ago

@deepiksdev I just finished training the Chinese wikipedia data from words to vector. And I make a uni-test. The result is very good and interesting. when I search the most similar words to "吃饭" I got

when I search the most similar word to "中文" I got

According to my Chinese background, I think the result is very good.

deepiksdev commented 7 years ago

@ALLENWAN1 Congrats, this sounds great ! Can you please upload the model to the S3 ?

ALLENWAN1 commented 7 years ago

I uploaded the model to s3. You can find it in https://s3.amazonaws.com/deepiks-training/models/rasa_chinese

deepiksdev commented 7 years ago

Brilliant, many thanks.

ALLENWAN1 commented 7 years ago

@deepiksdev I am going to work on defining the rules in Chinese for generating rasa model.

deepiksdev commented 7 years ago

@ALLENWAN1

I am going to work on defining the rules in Chinese for generating rasa model.

I am not sure I understand what you mean. Can you point me to some documentation ?

ALLENWAN1 commented 7 years ago

@deepiksdev https://spacy.io/docs/usage/adding-languages Maybe I didn't explain well.

deepiksdev commented 7 years ago

OK, thanks. I was not aware Rasa was using Spacy.

ALLENWAN1 commented 7 years ago

@deepiksdev rasa provide two options as backend. see http://rasa-nlu.readthedocs.io/en/stable/tutorial.html training part.

ALLENWAN1 commented 7 years ago

@deepiksdev When I try to study the rules for Chinese, I find there are some differences between Chinese and English. For example, If I say : "I want to swim" and "swimming is interesting". And we know "swim" is a verb and "swimming" is a nom.

But In Chinese we say: "我要去（I want to ）游泳(swim)" and "游泳（swimming）很有意思(is interesting)". We found that "游泳" can be a verb and also be a nom. Maybe it will make a conflict.

deepiksdev commented 7 years ago

@ALLENWAN1 Thanks for this explanation. Can you explain why this is a problem for the current issue ?

ALLENWAN1 commented 7 years ago

@deepiksdev Because I am afraid of that will make a mistake in NER. I am not sure. We will see when we finished the language model.

deepiksdev commented 7 years ago

@ALLENWAN1 OK, we will see.

deepiksdev commented 7 years ago

@ALLENWAN1 Have you looked at how Spacy implements Chinese: https://github.com/explosion/spaCy/tree/master/spacy/zh ?

ALLENWAN1 commented 7 years ago

@deepiksdev I am working on adding-languages

The first step : language-subclass I have finished it.

The second step: stop words,tag map and tokenizer exceptions. I am working on it. It is a little complicated because it is different for different languages. And Chinese is quitte different to the others.

The third part data I almost finished.

The last need the results of first 3 parts.

ALLENWAN1 commented 7 years ago

@deepiksdev

@ALLENWAN1 Have you looked at how Spacy implements Chinese: https://github.com/explosion/spaCy/tree/master/spacy/zh ?

I have found it before. And I do the same thing as they did. But no body finished it, so I need to continue it.

deepiksdev commented 7 years ago

@ALLENWAN1

I believe the Spacy tokenizer will probably not work, because it is based on spaces: see https://spacy.io/docs/usage/customizing-tokenizer . This may prevent us from using Spacy.

deepiksdev commented 7 years ago

@ALLENWAN1

I believe the Spacy tokenizer will probably not work, because it is based on spaces: see https://spacy.io/docs/usage/customizing-tokenizer . This may prevent us from using Spacy.

This is wrong. You need to install Jieba: https://pypi.python.org/pypi/jieba/ . Have you done so ?

ALLENWAN1 commented 7 years ago

@deepiksdev This problem have been solved. By using jieba, I got the result.

deepiksdev commented 7 years ago

@ALLENWAN1

This problem have been solved. By using jieba, I got the result.

Sorry, I don't understand. What problem ? What result ?

ALLENWAN1 commented 7 years ago

@deepiksdev sorry for forgetting copy the link.

I believe the Spacy tokenizer will probably not work, because it is based on spaces: see https://spacy.io/docs/usage/customizing-tokenizer . This may prevent us from using Spacy.

I means that I can do the tokenizer now. This will not be a problem. The result is that I used jieba to cut the wikipedia data, it shows that the result is good.

deepiksdev commented 7 years ago

@ALLENWAN1 Are you finally still working on this, or is it on hold ?

ALLENWAN1 commented 7 years ago

@deepiksdev it's on hold.

deepiksdev commented 7 years ago

OK, thanks. Closing it for the time being.

deepiksdev / rasa_nlu

Train Chinese language #1