Closed Aleyasen closed 5 years ago
Sounds like a great idea, but one of the brains (@tmbo @amn41) would have to comment. Either way it would likely be looked for as a community contribution.
yes! as of the latest major release (with spaCy 2.0 support) you can now use fastText vectors with Rasa. We will add some documentation on how to do that
That's awesome @amn41. Thanks for your reply. It'll be great if you can point to the documentation here.
Download vectors in your language, then ( from spacy website - slighlty modified to save model to disk)
#!/usr/bin/env python
# coding: utf8
"""Load vectors for a language trained using fastText
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals
import plac
import numpy
import spacy
from spacy.language import Language
@plac.annotations(
vectors_loc=("Path to .vec file", "positional", None, str),
lang=("Optional language ID. If not set, blank Language() will be used.", "positional", None, str),
output_dir=("Output dir", "positional", None, str))
def main(vectors_loc, lang=None, output_dir=None):
if lang is None:
nlp = Language()
else:
# create empty language class тАУ this is required if you're planning to
# save the model to disk and load it back later (models always need a
# "lang" setting). Use 'xx' for blank multi-language class.
nlp = spacy.blank(lang)
with open(vectors_loc, 'rb') as file_:
header = file_.readline()
nr_row, nr_dim = header.split()
nlp.vocab.reset_vectors(width=int(nr_dim))
count = 0
for line in file_:
count += 1
line = line.rstrip().decode('utf8')
pieces = line.rsplit(' ', int(nr_dim))
word = pieces[0]
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
nlp.to_disk(output_dir)
if __name__ == '__main__':
plac.call(main)
Then you can build a package following instructions there: https://spacy.io/usage/training#section-saving-loading
@znat I ran this code, I had an output that was creating a folder with the meta.json, tokenizer and a subfolder of vocab. What do I need to do after that?
@luisdemarchi Then you can build a package following instructions there: https://spacy.io/usage/training#section-saving-loading
@znat Ok. I discovered another "engine" that replaces the spaCy, called udpipe. It supports more than 100 languages and a test commented here, it wins in all languages. Is it possible for you to support him as well?
The thing is, udpipe doesn't provide word vectors. So it can only be used for tokenization and POS tagging.
Hi, @znat I loaded the fastText vectors and packaged the model but it did not provide any tagger or parser and I think Rasa needs a tagger as I receive an error trying to evaluate it with Rasa
in the meta.json I added in the pipeline ["tagger", "parser", "ner"]
What am I missing?
oh the ner_crf might insist on POS tags, we should check that out bc it should work without
Since 0.12 the vectors alone are not enough. I took an existing Spacy model and replaced the vectors
So you added the vectors from fast text to an existing Spacy model? I can add the tagger from an existing spacy model I suppose?
Just take an existing model with POS (the std one), replace the vectors with the ones you built above, and repackage the model. That's what I did.
Cool. i initialized an empty tagger with some examples and added it in the pipeline. @amn41 Did you take a look at why we need a tagger for the CRF?
we're currently running some experiments on CRFs without POS tagging, looking promising so will report back soon :)
It seems there are some hacks. Is there now any right guide how to implement fastText together with ner_crf? Why and how do I have to train the spacy POS-Tagger? That rises the questions which data I have to use for that? Or do I misunderstnad there something? I just want to use fastText and not training a POS-Tagger. The way using the existing spacy model seems right. But where and how do I replace the word vectors?
@ctrado18 you shouldn't have problems using the ner_crf with any fasttext vectors. Just add ner_crf to your pipeline
From spacy forum I get the answer for using fasttext vector here: https://spacy.io/usage/vectors-similarity#converting
But this seems different from here...Why do you have to use this code from the second post above. I would like to have this in the docs. It is confusing...I am not familiar using spacy...
Hi All, I am building a ChatBot in Hindi(hi) language using RASA stack with Spacy pipeline, I used FastText for building my Hindi model and I linked that with Spacy and linking was successful. Now I have to start using Hindi model but I am getting following error
C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\training_data\training_data.py:192: UserWarning: Intent 'greet' has only 1 training examples! Minimum is 2, training may fail. self.MIN_EXAMPLES_PER_INTENT)) C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\h5py\__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from
floatto
np.floatingis deprecated. In future, it will be treated as
np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters Traceback (most recent call last): File "nlu_model.py", line 18, in <module> train_nlu('./data/data.json', 'config_spacy.json', './models/nlu') File "nlu_model.py", line 8, in train_nlu trainer = Trainer(config.load(configs)) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\model.py", line 155, in __init__ self.pipeline = self._build_pipeline(cfg, component_builder) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\model.py", line 166, in _build_pipeline component_name, cfg) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\components.py", line 441, in create_component cfg) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\registry.py", line 142, in create_component_by_name return component_clz.create(config) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\utils\spacy_utils.py", line 73, in create nlp = spacy.load(spacy_model_name, parser=False) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\spacy\__init__.py", line 15, in load return util.load_model(name, **overrides) File "C:\Users\e01575\AppData\Local\Programs\Python\Python36\lib\site-packages\spacy\util.py", line 119, in load_model raise IOError(Errors.E050.format(name=name)) OSError: [E050] Can't find model 'hi'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
I also dont know how I have to proceed now with Hindi, like what all configuration will I need, what kind of intents I have to add like intent name should be in Hindi or English only same with the entities. Can someone help me get out of this? my model was successfully linked with spacy
this is strange, it looks like you have installed and linked your hindi model in the correct location. I still suspect though that this is an installation issue.
By the way, fastText vectors are super useful but you can also consider using the tensorflow pipeline https://rasa.com/docs/nlu/languages/ which doesn't require a language model
@amn41 Thanks for the quick help. I can see that it says you can use any language with tensorflow pipeline `Rasa NLU can be used to understand any language, but some backends are restricted to specific languages.
The tensorflow_embedding pipeline can be used for any language, because it trains custom word embeddings for your domain.`
but I cant find any help how exactly I can use for any specific language like Hindi. There is no help or tutorial which uses any other language except English. Can you help me in getting started with Hindi using tensorflow pipeline? that will work for me.
Sure - you don't need to do anything special actually, just use a configuration like:
language: "hi"
pipeline: "tensorflow_embedding"
(the language parameter doesn't actually do anything here). You can check out this cool post from @souvikg10 for an example
Hi @amn41 I was following the post and trying to build something out of it, I am again stuck somewhere in the configuration the language intents and utterance of the intents like in English if intent is "greet" then utter_greet will be the action and we can pass text for this action like "hello... How can I help you?" but in hindi I do I give name to the utter actions. One more thing my requirement is to not use English alphabets to write a sentence instead we want to use Hindi only like "рдирдорд╕реНрдХрд╛рд░" not "namaskar" check my configurations below . my config file :
my data.json file for Hindi
{ "rasa_nlu_data": { "common_examples": [ { "text": "рдкреНрд░рдгрд╛рдо", "intent": "рдирдорд╕реНрдХрд╛рд░", "entities": [] }, { "text": "рдирдорд╕реНрдХрд╛рд░ рдЬреА", "intent": "рдирдорд╕реНрдХрд╛рд░", "entities": [] }, { "text": "рдЖрдк рдХреЛ рдореЗрд░рд╛ рдирдорд╕реНрдХрд╛рд░!", "intent": "рдирдорд╕реНрдХрд╛рд░", "entities": [] }, { "text": "рдЖрдкрдХреЛ рдореЗрд░рд╛ рдкреНрд░рдгрд╛рдо", "intent": "рдирдорд╕реНрдХрд╛рд░", "entities": [] } ] } }
now I have doubt in my domain file like how to define intents and other things check out the file below
here my doubt is how to name my action like it can not be like "utter_рдирдорд╕реНрдХрд╛рд░" but as for as I understood it is the convention for giving name like if the intent is "greet" the action will be "utter_greet"
it can not be like "utter_рдирдорд╕реНрдХрд╛рд░"
Sorry I think I missed something. I am not very clear how I have to keep the names of intents and intent actions. If it can not be like "utter_рдирдорд╕реНрдХрд╛рд░" then what should I use instead if this? ot shall I go with the name in English and content in Hindi? like for ex : data.json file will look like :
and my domain file look like :
Can I follow like this?
Sorry I am troubling you a lot for such silly doubts but I do not have some example and I do not want to go in a wrong approach. I really appreciate your help. Thanks a lot for everything.
Sorry if my answer wasn't clear. The names of your intents and actions and utterances may be written in any language, there are no restrictions there. So long as your yaml is well-formatted, this will work. So you can have "utter_greet"
, "utter_рдирдорд╕реНрдХрд╛рд░"
, or just "рдирдорд╕реНрдХрд╛рд░"
. But yes if you want these to be recognised as UtterActions
then they should start with utter_
Thanks a lot @amn41 You really helped me a lot. Thanks a ton.....
Hi. Thanks for the help on loading a fastText model. I now have the model ready on a path on disk. I'm wondering what to specify in the 'language' setting on the config file for training the model.
@Harish238 You just need to specify the path of your language model
Thanks for the help, @souvikg10! I successfully trained a Hindi NLU model for by specifying the path of the model in the config file. However, the example sentence that I provided to test the model is not being recognised even though the same sentence is present in the training data.
Here is a snapshot of my config file -
{ "pipeline" : "spacy_sklearn", "language" : "/home/user/project/word2vec_models/hindi/hi_model-0.0.0/hi_model/hi_model-0.0.0", "path" : "./models/nlu", "data" : "./training_data/rasa_training_data.json" }
Here are some of the example sentences from the training data -
{"text": "рдлрд╝реЗрд╕рдмреБрдХ рдЦреЛрд▓реЛ", "intent": "intent_open_app", "entities": [{"start": 0, "end": 8, "value": "рдлрд╝реЗрд╕рдмреБрдХ", "entity": "intent_open_app_entity_appname"}]}, {"text": "рдУрдкрди рдлрд╝реЗрд╕рдмреБрдХ", "intent": "intent_open_app", "entities": [{"start": 4, "end": 12, "value": "рдлрд╝реЗрд╕рдмреБрдХ", "entity": "intent_open_app_entity_appname"}]}
I have also checked the vocab file for the words in my training examples. They are present.
I suppose first, you should consider upgrading to the latest Rasa NLU version 0.13 perhaps
second, can you load the spacy model using spacy.load and print a vector for a corresponding word? This is to test, if the .vec files actualyl works after transformation
third, try cross-validation on Rasa nlu to debug the performance of the model. I believe there could be an issue with the vector file from FastText and your transformation to spaCy. for me i built it in Dutch and it worked fine( not the best outcome but it was working)
Hi @souvikg10, the spacy model is outputting the vectors for example words without error. So there is no issue with the vector file.
Initially, when I generated training data in Hindi, I set ensure_ascii=False
when writing into a json so that I can read the script on a text editor. If this parameter is not set, it appears in Unicode in the text editor. However, I trained the model without setting the above parameter and I still could not get any results.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please create a new issue if you need more help.
To support a new language, I know from the documentation we can use
spacy-sklearn
orMITIE
. Since the language support for these two packages are also limited, I am wondering can we use FastText pre-trained word representation models and integrate them with rasa_nlu?FastText models: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md