Add FastText usage examples to docs

Aleyasen commented 6 years ago

To support a new language, I know from the documentation we can use spacy-sklearn or MITIE. Since the language support for these two packages are also limited, I am wondering can we use FastText pre-trained word representation models and integrate them with rasa_nlu?

FastText models: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

wrathagom commented 6 years ago

Sounds like a great idea, but one of the brains (@tmbo @amn41) would have to comment. Either way it would likely be looked for as a community contribution.

amn41 commented 6 years ago

yes! as of the latest major release (with spaCy 2.0 support) you can now use fastText vectors with Rasa. We will add some documentation on how to do that

Aleyasen commented 6 years ago

That's awesome @amn41. Thanks for your reply. It'll be great if you can point to the documentation here.

znat commented 6 years ago

Download vectors in your language, then ( from spacy website - slighlty modified to save model to disk)

#!/usr/bin/env python
# coding: utf8
"""Load vectors for a language trained using fastText
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals
import plac
import numpy

import spacy
from spacy.language import Language

@plac.annotations(
    vectors_loc=("Path to .vec file", "positional", None, str),
    lang=("Optional language ID. If not set, blank Language() will be used.", "positional", None, str),
    output_dir=("Output dir", "positional", None, str))
def main(vectors_loc, lang=None, output_dir=None):
    if lang is None:
        nlp = Language()
    else:
        # create empty language class – this is required if you're planning to
        # save the model to disk and load it back later (models always need a
        # "lang" setting). Use 'xx' for blank multi-language class.
        nlp = spacy.blank(lang)
    with open(vectors_loc, 'rb') as file_:
        header = file_.readline()
        nr_row, nr_dim = header.split()
        nlp.vocab.reset_vectors(width=int(nr_dim))
        count = 0
        for line in file_:
            count += 1
            line = line.rstrip().decode('utf8')
            pieces = line.rsplit(' ', int(nr_dim))
            word = pieces[0]
            vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
            nlp.vocab.set_vector(word, vector)  # add the vectors to the vocab
    nlp.to_disk(output_dir)

if __name__ == '__main__':
    plac.call(main)

Then you can build a package following instructions there: https://spacy.io/usage/training#section-saving-loading

luisdemarchi commented 6 years ago

@znat I ran this code, I had an output that was creating a folder with the meta.json, tokenizer and a subfolder of vocab. What do I need to do after that?

znat commented 6 years ago

@luisdemarchi Then you can build a package following instructions there: https://spacy.io/usage/training#section-saving-loading

luisdemarchi commented 6 years ago

@znat Ok. I discovered another "engine" that replaces the spaCy, called udpipe. It supports more than 100 languages and a test commented here, it wins in all languages. Is it possible for you to support him as well?

tmbo commented 6 years ago

The thing is, udpipe doesn't provide word vectors. So it can only be used for tokenization and POS tagging.

souvikg10 commented 6 years ago

Hi, @znat I loaded the fastText vectors and packaged the model but it did not provide any tagger or parser and I think Rasa needs a tagger as I receive an error trying to evaluate it with Rasa

in the meta.json I added in the pipeline ["tagger", "parser", "ner"]

What am I missing?

amn41 commented 6 years ago

oh the ner_crf might insist on POS tags, we should check that out bc it should work without

znat commented 6 years ago

Since 0.12 the vectors alone are not enough. I took an existing Spacy model and replaced the vectors

souvikg10 commented 6 years ago

So you added the vectors from fast text to an existing Spacy model? I can add the tagger from an existing spacy model I suppose?

znat commented 6 years ago

Just take an existing model with POS (the std one), replace the vectors with the ones you built above, and repackage the model. That's what I did.

souvikg10 commented 6 years ago

Cool. i initialized an empty tagger with some examples and added it in the pipeline. @amn41 Did you take a look at why we need a tagger for the CRF?

amn41 commented 6 years ago

we're currently running some experiments on CRFs without POS tagging, looking promising so will report back soon :)

ctrado18 commented 6 years ago

It seems there are some hacks. Is there now any right guide how to implement fastText together with ner_crf? Why and how do I have to train the spacy POS-Tagger? That rises the questions which data I have to use for that? Or do I misunderstnad there something? I just want to use fastText and not training a POS-Tagger. The way using the existing spacy model seems right. But where and how do I replace the word vectors?

znat commented 6 years ago

@ctrado18 you shouldn't have problems using the ner_crf with any fasttext vectors. Just add ner_crf to your pipeline

ctrado18 commented 6 years ago

From spacy forum I get the answer for using fasttext vector here: https://spacy.io/usage/vectors-similarity#converting

But this seems different from here...Why do you have to use this code from the second post above. I would like to have this in the docs. It is confusing...I am not familiar using spacy...

shrikrishna7744 commented 6 years ago

Hi All, I am building a ChatBot in Hindi(hi) language using RASA stack with Spacy pipeline, I used FastText for building my Hindi model and I linked that with Spacy and linking was successful. Now I have to start using Hindi model but I am getting following error

C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\training_data\training_data.py:192: UserWarning: Intent 'greet' has only 1 training examples! Minimum is 2, training may fail. self.MIN_EXAMPLES_PER_INTENT)) C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\h5py\__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype fromfloattonp.floatingis deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters Traceback (most recent call last): File "nlu_model.py", line 18, in <module> train_nlu('./data/data.json', 'config_spacy.json', './models/nlu') File "nlu_model.py", line 8, in train_nlu trainer = Trainer(config.load(configs)) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\model.py", line 155, in __init__ self.pipeline = self._build_pipeline(cfg, component_builder) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\model.py", line 166, in _build_pipeline component_name, cfg) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\components.py", line 441, in create_component cfg) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\registry.py", line 142, in create_component_by_name return component_clz.create(config) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\utils\spacy_utils.py", line 73, in create nlp = spacy.load(spacy_model_name, parser=False) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\spacy\__init__.py", line 15, in load return util.load_model(name, **overrides) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\spacy\util.py", line 119, in load_model raise IOError(Errors.E050.format(name=name)) OSError: [E050] Can't find model 'hi'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

I also dont know how I have to proceed now with Hindi, like what all configuration will I need, what kind of intents I have to add like intent name should be in Hindi or English only same with the entities. Can someone help me get out of this? my model was successfully linked with spacy

amn41 commented 6 years ago

this is strange, it looks like you have installed and linked your hindi model in the correct location. I still suspect though that this is an installation issue.

By the way, fastText vectors are super useful but you can also consider using the tensorflow pipeline https://rasa.com/docs/nlu/languages/ which doesn't require a language model

shrikrishna7744 commented 6 years ago

@amn41 Thanks for the quick help. I can see that it says you can use any language with tensorflow pipeline `Rasa NLU can be used to understand any language, but some backends are restricted to specific languages.

The tensorflow_embedding pipeline can be used for any language, because it trains custom word embeddings for your domain.`

but I cant find any help how exactly I can use for any specific language like Hindi. There is no help or tutorial which uses any other language except English. Can you help me in getting started with Hindi using tensorflow pipeline? that will work for me.

amn41 commented 6 years ago

Sure - you don't need to do anything special actually, just use a configuration like:

language: "hi"
pipeline: "tensorflow_embedding"

(the language parameter doesn't actually do anything here). You can check out this cool post from @souvikg10 for an example

shrikrishna7744 commented 6 years ago

Hi @amn41 I was following the post and trying to build something out of it, I am again stuck somewhere in the configuration the language intents and utterance of the intents like in English if intent is "greet" then utter_greet will be the action and we can pass text for this action like "hello... How can I help you?" but in hindi I do I give name to the utter actions. One more thing my requirement is to not use English alphabets to write a sentence instead we want to use Hindi only like "नमस्कार" not "namaskar" check my configurations below . my config file :

my data.json file for Hindi

{ "rasa_nlu_data": { "common_examples": [ { "text": "प्रणाम", "intent": "नमस्कार", "entities": [] }, { "text": "नमस्कार जी", "intent": "नमस्कार", "entities": [] }, { "text": "आप को मेरा नमस्कार!", "intent": "नमस्कार", "entities": [] }, { "text": "आपको मेरा प्रणाम", "intent": "नमस्कार", "entities": [] } ] } }

now I have doubt in my domain file like how to define intents and other things check out the file below

here my doubt is how to name my action like it can not be like "utter_नमस्कार" but as for as I understood it is the convention for giving name like if the intent is "greet" the action will be "utter_greet"

amn41 commented 6 years ago

it can not be like "utter_नमस्कार"

this should work fine actually. If you aren't already though, I would make sure that you are using python 3, because python 2 causes a lot of headaches when dealing with non-ascii text.

shrikrishna7744 commented 6 years ago

Sorry I think I missed something. I am not very clear how I have to keep the names of intents and intent actions. If it can not be like "utter_नमस्कार" then what should I use instead if this? ot shall I go with the name in English and content in Hindi? like for ex : data.json file will look like :

and my domain file look like :

Can I follow like this?

Sorry I am troubling you a lot for such silly doubts but I do not have some example and I do not want to go in a wrong approach. I really appreciate your help. Thanks a lot for everything.

amn41 commented 6 years ago

Sorry if my answer wasn't clear. The names of your intents and actions and utterances may be written in any language, there are no restrictions there. So long as your yaml is well-formatted, this will work. So you can have "utter_greet", "utter_नमस्कार", or just "नमस्कार". But yes if you want these to be recognised as UtterActions then they should start with utter_

shrikrishna7744 commented 6 years ago

Thanks a lot @amn41 You really helped me a lot. Thanks a ton.....

harish-vnkt commented 6 years ago

Hi. Thanks for the help on loading a fastText model. I now have the model ready on a path on disk. I'm wondering what to specify in the 'language' setting on the config file for training the model.

souvikg10 commented 6 years ago

@Harish238 You just need to specify the path of your language model

harish-vnkt commented 6 years ago

Thanks for the help, @souvikg10! I successfully trained a Hindi NLU model for by specifying the path of the model in the config file. However, the example sentence that I provided to test the model is not being recognised even though the same sentence is present in the training data.

Here is a snapshot of my config file - { "pipeline" : "spacy_sklearn", "language" : "/home/user/project/word2vec_models/hindi/hi_model-0.0.0/hi_model/hi_model-0.0.0", "path" : "./models/nlu", "data" : "./training_data/rasa_training_data.json" }

Here are some of the example sentences from the training data - {"text": "फ़ेसबुक खोलो", "intent": "intent_open_app", "entities": [{"start": 0, "end": 8, "value": "फ़ेसबुक", "entity": "intent_open_app_entity_appname"}]}, {"text": "ओपन फ़ेसबुक", "intent": "intent_open_app", "entities": [{"start": 4, "end": 12, "value": "फ़ेसबुक", "entity": "intent_open_app_entity_appname"}]}

harish-vnkt commented 6 years ago

I have also checked the vocab file for the words in my training examples. They are present.

souvikg10 commented 6 years ago

I suppose first, you should consider upgrading to the latest Rasa NLU version 0.13 perhaps

second, can you load the spacy model using spacy.load and print a vector for a corresponding word? This is to test, if the .vec files actualyl works after transformation

third, try cross-validation on Rasa nlu to debug the performance of the model. I believe there could be an issue with the vector file from FastText and your transformation to spaCy. for me i built it in Dutch and it worked fine( not the best outcome but it was working)

harish-vnkt commented 6 years ago

Hi @souvikg10, the spacy model is outputting the vectors for example words without error. So there is no issue with the vector file.

Initially, when I generated training data in Hindi, I set ensure_ascii=False when writing into a json so that I can read the script on a text editor. If this parameter is not set, it appears in Unicode in the text editor. However, I trained the model without setting the above parameter and I still could not get any results.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed due to inactivity. Please create a new issue if you need more help.

RasaHQ / rasa

Add FastText usage examples to docs #869