RasaHQ / rasa

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
https://rasa.com/docs/rasa/
Apache License 2.0
18.63k stars 4.6k forks source link

Bahasa #1112

Closed Primtek closed 6 years ago

Primtek commented 6 years ago
**Rasa NLU version**: **Operating system** (windows, osx, ...): **Content of model configuration file**: ```yml ``` **Issue**: Any body already implement rasa_nlu in Bahasa? I didnot find any article or ref that saying bahasa is supported. Need info Thanks
souvikg10 commented 6 years ago

There are few ways, you can support a custom language as such.

  1. Use Tensorflow pipeline and create word vectors from scratch refer to this post from Alan this might be useful https://medium.com/rasa-blog/supervised-word-vectors-from-scratch-in-rasa-nlu-6daf794efcd8

I tried the same with a non-latin language using latin characters, Results were good for me but was just an experiment. https://medium.com/@souvikghosh_14630/building-a-rasa-chatbot-in-bengali-using-supervised-word-vectors-from-scratch-740ebede51ee

  1. You can also use FastText from Facebook that has support for over 170 languages #869 , follow this thread to see how you can add FastText to Spacy's language model and then use it with Rasa's Spacy pipeline.

  2. If you have raw language data, you can also train your vectors with SpaCy and then incorporate same as above. https://spacy.io/usage/training#section-basics

saqib-ahmed commented 6 years ago

Just to add in the above comment, there's also this PR which, when gets merged, will give language agnostic NER too (intent classification is already there with tensorflow embedding). So you can use the following pipeline for intents and entity recognition for any language:

pipeline:
- name: "tokenizer_whitespace"
- name: "ner_crf"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"

You can also try it before it gets merged. The code is in the lean-CRF branch. I tried it with a small Arabic dataset and worked perfectly (identifying correct intents and entities). It wouldn't be much of a problem to use it with Bahasa.

Primtek commented 6 years ago

thanks everyone! .. will try