facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.95k stars 4.72k forks source link

wordNgrams in unsupervised mode (cbow and skipgram) #499

Closed mino98 closed 6 years ago

mino98 commented 6 years ago

Hi, quick question.

I see that word n-grams (i.e., -wordNgrams) are only used in supervised mode, and not in cbow nor skipgram.

Is there a reason for this?

The documentation is not clear on this point, but the code calls addWordNgrams() only here inside Dictionary::getLine() used by supervised training and not here in the equivalent Dictionary::getLine() used by unsupervised methods.

Thanks.

EdouardGrave commented 6 years ago

Hi @mino98,

Yes, you are correct: word n-grams are only used in supervised mode.

The reason is that when training unsupervised models on large amount of data, the number of n-grams is extremely large, and most of them are not informative (e.g. all the bigrams of the type the *). Thus, using word n-grams does not significantly improves the quality of learned models.

One way to address this issue is to only consider "informative" n-grams, such as New York City. This can be done by only keeping n-grams made of words with high mutual information. This technique is described in Section 4 of the paper Distributed Representations of Words and Phrases and their Compositionality, and we used it to train our latest English word representations (see section 2.3 of the paper Advances in Pre-Training Distributed Word Representations for more information). In particular, we replaced high mutual information phrases such as New York City by New_York_City with a probability of 0.5 of the time in the training data. We are thinking about making this part of the fastText tool.

Please re-open this issue if you have additional questions!

Best, Edouard

eduamf commented 4 years ago

Hi @mino98,

Yes, you are correct: word n-grams are only used in supervised mode.

The reason is that when training unsupervised models on large amount of data, the number of n-grams is extremely large, and most of them are not informative (e.g. all the bigrams of the type the *). Thus, using word n-grams does not significantly improves the quality of learned models.

One way to address this issue is to only consider "informative" n-grams, such as New York City. This can be done by only keeping n-grams made of words with high mutual information. This technique is described in Section 4 of the paper Distributed Representations of Words and Phrases and their Compositionality, and we used it to train our latest English word representations (see section 2.3 of the paper Advances in Pre-Training Distributed Word Representations for more information). In particular, we replaced high mutual information phrases such as New York City by New_York_City with a probability of 0.5 of the time in the training data. We are thinking about making this part of the fastText tool.

Please re-open this issue if you have additional questions!

Best, Edouard

It`s a good thinking!

stelmath commented 4 years ago

Hello @EdouardGrave . I am a bit confused about the usage of embeddings for phrases/collocates. The question is this: in this paper https://arxiv.org/abs/1712.09405 it is mentioned:

We plan to release the model containing all the phrases in the near future

So, do the latest English models on the fasttext.cc website contain embeddings for phrases like, for example, New_Yorkor United_States, or not? Thank you.