Closed mino98 closed 6 years ago
Hi @mino98,
Yes, you are correct: word n-grams are only used in supervised mode.
The reason is that when training unsupervised models on large amount of data, the number of n-grams is extremely large, and most of them are not informative (e.g. all the bigrams of the type the *
). Thus, using word n-grams does not significantly improves the quality of learned models.
One way to address this issue is to only consider "informative" n-grams, such as New York City
. This can be done by only keeping n-grams made of words with high mutual information. This technique is described in Section 4 of the paper Distributed Representations of Words and Phrases
and their Compositionality, and we used it to train our latest English word representations (see section 2.3 of the paper Advances in Pre-Training Distributed Word Representations for more information). In particular, we replaced high mutual information phrases such as New York City
by New_York_City
with a probability of 0.5 of the time in the training data. We are thinking about making this part of the fastText tool.
Please re-open this issue if you have additional questions!
Best, Edouard
Hi @mino98,
Yes, you are correct: word n-grams are only used in supervised mode.
The reason is that when training unsupervised models on large amount of data, the number of n-grams is extremely large, and most of them are not informative (e.g. all the bigrams of the type
the *
). Thus, using word n-grams does not significantly improves the quality of learned models.One way to address this issue is to only consider "informative" n-grams, such as
New York City
. This can be done by only keeping n-grams made of words with high mutual information. This technique is described in Section 4 of the paper Distributed Representations of Words and Phrases and their Compositionality, and we used it to train our latest English word representations (see section 2.3 of the paper Advances in Pre-Training Distributed Word Representations for more information). In particular, we replaced high mutual information phrases such asNew York City
byNew_York_City
with a probability of 0.5 of the time in the training data. We are thinking about making this part of the fastText tool.Please re-open this issue if you have additional questions!
Best, Edouard
It`s a good thinking!
Hello @EdouardGrave . I am a bit confused about the usage of embeddings for phrases/collocates. The question is this: in this paper https://arxiv.org/abs/1712.09405 it is mentioned:
We plan to release the model containing all the phrases in the near future
So, do the latest English models on the fasttext.cc website contain embeddings for phrases like, for example, New_York
or United_States
, or not? Thank you.
Hi, quick question.
I see that word n-grams (i.e.,
-wordNgrams
) are only used in supervised mode, and not in cbow nor skipgram.Is there a reason for this?
The documentation is not clear on this point, but the code calls
addWordNgrams()
only here insideDictionary::getLine()
used by supervised training and not here in the equivalentDictionary::getLine()
used by unsupervised methods.Thanks.