Closed tttthomasssss closed 4 years ago
It is worth pointing out that spaCy is also working on this. If spaCy supports this then that would be the easier route of supporting it here, no? You can wrap the spaCy model as a python package and just import that in Rasa. We'd only need to support their protocol and we already do.
I fear that making our own EmbeddingFeaturizer
would also require us to support many different formats. I could be wrong, but I recall there begin different standards on how to save word embeddings across different academic teams who pretrain these embeddings (GloVe vs. fasttext etc.)
In the past at least, the only way to use fastText or other embeddings was to package them into a custom spaCy model, which is quite a bit of friction. If the spaCy folks make that much easier, then we don't need to do anything imo. Otherwise I think it might be worth adding this component so it's just much easier to use those 150+ languages.
I haven't tried it, but it seems pretty easy. We might still write a helper that wraps around this but it reads as if spaCy has this covered.
@tttthomasssss I might have time later today to explore this if it helps, I could try and see how easy it is to get fasttext working in whatlies
via spaCy?
@koaning that would be great yes!
But there are not really that many different formats that pre-trained vectors typically get distributed in. For example, the Glove vectors from the stanford website and the fasttext vectors come in the same (easy to parse) format. The original word2vec ones were binary files from C, but they can also fairly easily be parsed. So I don't think we'd be supporting lots of different formats.
I have a thin wrapper with loading utilities for both kinds here: https://github.com/tttthomasssss/wolkenatlas/blob/master/wolkenatlas/util/data_processing.py
As always there is a tradeoff between additional code we'd need to maintain vs. relying on a 3rd party to provide that functionality. In this case I think its worth rolling our own for more flexibility. I mainly have use cases in mind like:
I've done some digging and here's my findings.
The embeddings are big but spaCy can read them in from the commandline no problem it seems. I just downloaded the dutch ones and got a spaCy model on disk via;
> python -m spacy init-model en /tmp/dutch_vectors_wiki_lg --vectors-loc cc.nl.300.vec.gz
✔ Successfully created model
2000000it [02:55, 11413.26it/s]0.vec.gz
✔ Loaded vectors from cc.nl.300.vec.gz
✔ Sucessfully compiled vocab
2000255 entries, 2000000 vectors
You can get the same effect in python by running;
import spacy
spacy.cli.init_model(lang=language, output_dir=output_dir, vectors_loc=vectors_loc)
Now this merely creates a model on disk. You could load it via spacy.load(/tmp/dutch_vectors_wiki_lg)
but odds are that you want to package it instead for production. I'm working on the guide for that here. Note that spaCy is dropping python -m spacy link
and the path forward is to use python -m spacy package
.
On the side of whatlies I've now got a PR to support this here. The way that I've implemented it is via a classmethod on my Langauge
class.
@classmethod
def from_fasttext(cls, language, output_dir, vectors_loc=None, force=False):
"""
Will load fastext tokens. It will try to load from disk, but if there is no local
spaCy model then we will first convert from the vec.gz file into a spaCy model. This
is saved on disk and then loaded as a spaCy model.
Important:
The fasttext vectors are not given by this library.
You can download the models [here](https://fasttext.cc/docs/en/crawl-vectors.html#models).
Note that these files are big that and loading this in can take a long time.
Arguments:
language: name of the language so that spaCy can grab correct tokenizer (example: "en" for english)
output_dir: directory to save spaCy model
vectors_loc: file containing the fasttext vectors
force: with this flag raised we will always recreate the model from the vec.gz file
```python
> lang = SpacyLanguage.from_texttext("~/Downloads/cc.nl.300.vec.gz")
"""
if not os.path.exists(output_dir):
spacy.cli.init_model(lang=language, output_dir=output_dir, vectors_loc=vectors_loc)
else:
if force:
spacy.cli.init_model(lang=language, output_dir=output_dir, vectors_loc=vectors_loc)
return SpacyLanguage(spacy.load(output_dir))
Odds are you could do something similar for Rasa but I don't know the featurizer components well enough. I also don't know if the native fasttext libraries are faster and how they allow you to package everything together for production.
All in all, it *seems* well possible with spaCy now but I'll gladly hear if I'm missing something.
Ah. I just found a caveat. It seems that fasttext, the python package, has a feature to reduce the dimensions of their vectors internally. This is something that spaCy does not support. For whatlies
this is a good reason to support it as a seperate language, if that is also the case here then it might make sense to write a custom component for it.
Screenshot from their docs:
I am also just experimenting with the spacy-fasttext bridge. It works and isn't a lot of effort to get working (though I can only speak for the purpose of experimentation, not production).
A few of the caveats I have encountered are:
PyThaiNLP
, which needs to be downloaded manually before they can be used from a rasa pipeline). This would be somewhat redundant within rasa as we already have an interface for tokenisers and the EmbeddingsFeaturizer should have nothing to do with it, i.e. there might be the case of a user doing low-level processing with library X
, but is required to download library Y
because spacy says it needs it (even though we only want embeddings and not e.g. PoS tagging as well).gensim
much more broadly.I think that using spacy is a viable workaround for now, but it depends on whether we want to treat word embeddings as first-class citizens in the same way as we treat conveRT or huggingface transformers. If we want them to be first-class citizens, we should probably roll our own (and I think the amount of maintainable code we'd be introducing can be kept small enough such that it would be outweighed by potential benefits). If we don't want that then the workaround with spacy needs to be at least something that is well documented.
I've wrapped up implementing fasttext for whatlies and I have a few impression.
fasttext
project on github feels a big wonky in terms of support. They thankfully just released a new version (1 hour ago, not 6 hours ago when I needed it). This was the first new release in a year. fasttext
directly seems to be a lot faster with regards to loading the vectors in memory. Hello, i have change rasa’s older version to version 1.9.5. I have a following issue: I wrote custom featurizer component based on word2vec and sent2vec. Each sentence has a 576 length feature vector (for ex. [0.12, … , 0.2]). In previous version i used to write self._combine_with_existing_text_features to tell the embedding_classifier to classify sentences by word2vec features, but in newer version rasa don’t have a this method. it has two methods: _combine_with_existing_sparse_features and _combine_with_existing_text_features. when i used _combine_with_existing_sparse_features method to concatenate sparse (or dense) features together i had a following error: features = self._combine_with_existing_sparse_features(message, optio_features, 'text_sparse_features') File "/home/optio/rasaprojects/rasaenv/lib/python3.6/site-packages/rasa/nlu/featurizers/featurizer.py", line 89, in _combine_with_existing_sparse_features if message.get(feature_name).shape[0] != additional_features.shape[0]: AttributeError: 'list' object has no attribute 'shape' . It seems that this method does not like my 576 dimension feature vector.
Can you help me to make DIET classifier use my word2vec features and use it for classification? or can you just some examples how to write custom featurizer for DIET classifier.
@3NFBAGDU Thanks for the issue description. Please ask your question next time on the forum as it does not belong to this issue.
As you already noticed we divided our featurizers into sparse and dense featurizers (see docs). Your word2vec featurizer falls into the category of dense featurizers. So you should use the method _combine_with_existing_dense_features
. We also introduced a __CLS__
token: All tokenizers add an additional token CLS to the end of the list of tokens when tokenizing text and responses. Make sure to consider that in your featurizer.
If you have any other issues or questions, please go to the forum and open a new thread over there. Thanks.
Hi,
Dear @tabergma, I represent the company where @3NFBAGDU works, we are heavily depended on RASA and are using it for 2 years now. I understand that maybe the form of the question or details seemed little bit different, but I believe this issue is the same that we are facing. So far we have built various custom components, one of which was injecting custom trained (with gensim) word2vec & sent2vec models in the pipeline using _combine_with_existing_text_features.
With new RASA, we tried to upgrade this custom component but no luck. Of course, we have already looked at docs, sparse and dense featurizers, but could not make it work. We also asked this question on the forum more than month ago (see it here) but didn't got any relevant answer (btw, forum is not always the best place for "dev" type of questions from our observation).
Said that, I believe this is the right place for the discussion and our question (again, apologies if the form of the question was not specific) is exactly about the same thing as seen in the issue: how to add support for word2vec/fastText models (either pre-trained or custom trained) to the new RASA pipeline. Correct me if I'm wrong.
Thanks
@tttthomasssss can we close this issue now? With the advent of Rasa NLU examples I think we're going to solve this problem there.
Description of Problem:
At the moment there is no support for plugging any kind of embeddings into our pipeline. For example, while GloVe is supported, the embeddings are currently accessed via spaCy, but there is no native support for using pre-trained word2vec, fastText, etc embeddings or provide a way that users can plug in their own embeddings (e.g. when training them on a private corpus with gensim). Given that esp. pre-trained fastText embeddings are available in 150+ languages, this would provide quite a boost for better multilingual support.
Current Situation:
Currently, we have support for word-level embeddings via spaCy, but no native support for dense pre-trained word embeddings in rasa.
Overview of the Solution:
An implementation could follow the current
featurizer
architecture by adding anEmbeddingsFeaturizer
to the existing set of modules.Definition of Done: