facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.75k stars 4.7k forks source link

-Nan(Ind) Sentence Vector #343

Open creat89 opened 6 years ago

creat89 commented 6 years ago

Hello,

I'm using fastText and I'm getting, for a document, a vector full of -nan(ind) when I use the option print-sentence-vector. However, if I ask for the vector of each word, with print-sentence-vector, all the words have a numerical vector. Which could be the problem? Any idea, where to look for, in order to give you a better description of the problem?

The fastText model was trained by me (using the unsupervised method) with 300 dimensions. The document has 3478 words.

creat89 commented 6 years ago

I have found the reason. There is one word that has a vector full of 0. Once I delete that word, the sentence has a numeric vector instead of -nand(ind). Why is this 0 word vector affects the calculation of the sentence one? How can I change this behavior?

cpuhrsch commented 6 years ago

Hello @creat89,

Thank you for your post. In order to reproduce the issue on my end, could you please post the full set of commands used to train the model and trigger the error you describe?

Thank you, Christian

creat89 commented 6 years ago

Hello @cpuhrsch ,

These are one set of hyper parameters that causes the error: -lr 0.062098028681721665 -dim 200 -wordNgrams 1 -minCount 3 -epoch 10 -minn 6 -maxn 6 The error happens using either cbow or skipgrams. (The parameters may look not the best, but I'm using a Bayesian optimization to find the best combination)

cpuhrsch commented 6 years ago

Hello @creat89,

I've not been able to reproduce your issue, but I think having a word with an associated zero vector might be the problem here. If I recall correctly there was an issue around that a long time ago. Could you try this again with a recent version of fastText and let me know if this resolves your issue?

Thanks, Christian