HashingVectorizer - Githubissues

koaning commented 3 years ago

In an attempt to deal with the explosion of spelling error tokens, we may want to explore "the Hashing trick" some more. Inspired by spaCy, we may have a lot to gain from adding multiple HashingVectorizers in the pipeline.

We can build on top of this sklearn feature.

mleimeister commented 2 years ago

Hi @koaning, I could try to start working on this component. Just to clarify, should the desired extractor basically wrap sklearn's HashingVectorizer and use a single hash function to produce static non-trainable embeddings? Or would the aim be to use a combination of hash functions and a trainable embedding table, as in Bloom embeddings / Spacy's HashEmbed?

koaning commented 2 years ago

The goal would indeed be to wrap around the scikit-learn HashingVectorizer.

The thing is ... given such a featurizer ... DIET will take care of the "train-able embeddings table". It's explained in more detail on our forum here. But the sparse input will be turned into a dense representation as a side effect of training DIET.

Given a sparse input, it passes a dense "embedding" layer.

The non-zero indices of the sparse array will be picked up.

Hence, also with a hashingvectorizer, we're able to omit a separate vector table because DIET takes care of it.

koaning commented 2 years ago

The only awkward thing with the scikit-learn implementation is that you cannot tell it to hash a word three times. It can only hash once. So if you really want to use the bloom trick in spaCy you're going to have to add multiple hashing vectorizers, each with a different bucket size. This isn't a huge issue for us, since our config system allows for this.

koaning commented 2 years ago

One thing @mleimeister right now the repository only supports Rasa 2.x. Feel free to build your component against the 3.x release candidate.

Also feel free to check with Adam and Daksh ... it may also be nice to think about a small benchmark using this component.

mleimeister commented 2 years ago

Hi @koaning, a first draft implementation is ready here and runs with Rasa 2.8.12: https://github.com/RasaHQ/rasa-nlu-examples/pull/153

While working on it, some questions came up that would be great to discuss. They mainly stem from looking at the implementation of Rasa's CountVectorsFeaturizer and which parts of that might be relevant here as well. Some points I wasn't sure about:

Sklearn's HashingVectorizer does its own tokenization, via the analyzer and token_pattern parameters. However, since the component is run as part of the NLU pipeline with a tokenizer already applied before, is there any scenario where users would want to change these, e.g. using analyzer = "char" or a token pattern that splits the already tokenized tokens differently? Otherwise fixing them to analyzer="word" and token_pattern=r"(?u)\b\w+\b" would be an option.
The SpacyTokenizer adds lemmas of words as separate tokens. Would we want a parameter use_lemmas for this component as well that decides if the lemma or the full token is vectorized?
The featurizers in rasa-nlu-examples currently vectorise the DENSE_FEATURIZABLE_ATTRIBUTES. However, there's also MESSAGE_ATTRIBUTES which on top contains the INTENT, ACTION_NAME and INTENT_RESPONSE_KEY texts. Would those be relevant only if using e2e TED? Should they be processed by this component as well in this case?

Thanks for any input. Let me know if I should provide more info or it would be better to discuss in person :)

koaning commented 2 years ago

Rasa 2.8.12

It's fine to implement it to 2.8.12 for now, but we will need to port it soon to Rasa 3.0. I'm fine with 2.8 for now, but it's good to observe this will change.

Sklearn's HashingVectorizer does its own tokenization

I could be wrong, but I think we just look at the strings in each token to add features for the tokens. We then do a separate pass over the text to have countvectors for the entire utterance as well. I'd certainly want the analyzer to be implemented but I'm not sure if the token_pattern is configuration in our countvectorizer.

The SpacyTokenizer adds lemmas of words as separate tokens. Would we want a parameter use_lemmas for this component as well that decides if the lemma or the full token is vectorized?

That's a good question. For an initial implementation, I'd say it's fine to skip though. The use_lemmas feature isn't used that often.

The featurizers in rasa-nlu-examples currently vectorise the DENSE_FEATURIZABLE_ATTRIBUTES.

I think this is fine for the initial implementation. The main "theory" that we want to confirm is whether or not this works for intent classification. Once that is confirmed, we might move on to more features.

RasaHQ / rasa-nlu-examples

HashingVectorizer #148