explosion / sense2vec

🦆 Contextually-keyed word vectors
https://explosion.ai/blog/sense2vec-reloaded
MIT License
1.62k stars 240 forks source link

Can we do transfer learning on closed domain dataset #114

Closed deepankar27 closed 3 years ago

deepankar27 commented 4 years ago

Can we use this existing model for transfer learning on closed domain dataset?

Nina-Xu-Guru commented 3 years ago

I was wondering the same thing. Would the authors consider publishing the trained model parameters, in addition to the word vectors, so we could do transfer learning on small data sets?

polm commented 3 years ago

What model parameters are you referring to? The word vectors are the only output of the model.

As to whether you can use this for transfer learning, that's possible in the same way it's possible with any word vectors, though unless your data is similar to Reddit in some way you might be better off training from scratch.

Nina-Xu-Guru commented 3 years ago

Thank you @polm. I was referring to the neural net architecture and learned weights, that essentially takes a token and spits out the word vector. I was under the impression that we need those to do transfer learning. If I'm mistaken, I'd appreciate it if you could point me to the documentation on how to do transfer learning for s2v!

I'm considering transfer learning because we don't have 1 billion words. Our sample size is much smaller.

polm commented 3 years ago

The way sense2vec works is it preprocesses text to handle both tokenization and labelling of tokens. You can see that in the preprocessing script. Here's example output of the process:

Rats|NOUN ,|PUNCT mould|NOUN and|CCONJ broken_furniture|NOUN :|PUNCT
the|DET scandal|NOUN of|ADP the|DET UK|GPE 's|PART refugee_housing|NOUN

This output is then fed to a typical word vector training algorithm, like GloVe or fasttext. There is no sense2vec-specific model.

I'm considering transfer learning because we don't have 1 billion words. Our sample size is much smaller.

You do not need one billion words to train a useful model. I would recommend at least trying out training vanilla word vectors and sense2vec on your data and seeing how well they work.

If you want to do transfer learning on the pretrained models anyway, there's no special process for that, it would work like any other word vectors, the main difficulty is that due to the preprocessing it won't be compatible with normal tokenizers.