Why fastText's algorithm gets worse when number of notes increases?

ssoheilmn commented 5 years ago

I'm using this command to train a model: $ ./fasttext skipgram -input train.txt -output model Once the model is trained, I use a python script to find top 50 relevant words to a given word. When I train the model using default settings, I receive reasonable results if using only 10,000 documents for training. As I increase the number of documents to 100,000, 1,000,000, and 50,000,000, the results get worse and worse (suggested words get irrelevant).

I tried running the algorithm with the following changes, but still results get irrelevant as the corpus size grows:

Normalizing the input text by removing punctuations, case normalization, stopword removal, etc.
Setting -minn argument to 3 (min length of char ngram = 3)
Setting -maxn argument to 4 (max length of char ngram = 4)
Setting -minCount to 40 (minimal number of word occurrences = 40)
Setting -dim to 300 (size of word vectors = 300)

I know in the paper they reported that results do get slightly worse when the corpus size increases (Figure 1), but I'm wondering 1) why is this happening in general; 2) why is it happening in a very large scale for me; and 3) is there any remedy to this?

EdouardGrave commented 5 years ago

Hi @ssoheilmn,

Thank you for your question!

What kind of data are you using to train your model? One potential explanation is that the supervised task you are using to learn the model is not a good proxy to learn a good similarity function between word vectors. Have you tried using unsupervised models such as cbow or skipgram to learn the word vectors?

Best, Edouard

ssoheilmn commented 5 years ago

Hi @EdouardGrave,

Thanks for the reply.

I am using fastText in an unsupervised task of reading clinical notes and creating the word embeddings. Then I can find similar words to any given word utilizing proximity in the word embedding space. I have done the exact same thing using Word2Vec before and the algorithm just got better and better as number of documents increased.

Am I supposed to pass an argument other than "supervised" in order to achieve word embedding training in the command below? $ ./fasttext supervised -input train.txt -output model

EdouardGrave commented 5 years ago

Hi @ssoheilmn,

In order to learn (unsupervised) word embeddings, you should use the skipgram (or cbow) subcommand, instead of supervised. For example, you can use:

$ ./fasttext skipgram -input train.txt -output model

There are various hyper-parameters that you can try to change, to see the impact on performance:

size of character n-grams is changed with -minn and -maxn. I suggest trying something like -minn 4 -maxn 6 ;
dimension of embeddings is changed with -dim. The default is 100, and I suggest trying values in the range 100-300 ;
number of negatives is changed with -neg. The default is 5, and I suggest trying values in the range 5-20 ;
number of epochs is changed with -epoch. The default is 5, and I suggest trying values in the range 5-50.

Best, Edouard.

ssoheilmn commented 5 years ago

@EdouardGrave,

You are right. I just checked my code and realized I was using the skipgram argument, exactly as you mentioned. I made a mistake reporting the wrong argument when I was writing the question. I just fixed my initial post.

I also have played with couple of parameters such as -minn, -maxn, -dim, and -minCount, but none helped with this specific problem.

EdouardGrave commented 5 years ago

You can try setting maxn to 0 and should get similar results to word2vec.

ssoheilmn commented 5 years ago

Could you please explain why? The default value for maxn is 0, with which I had the same issue of getting worse results while corpus size increases. I'm thinking wouldn't maxn = 0 mean no maximum, so still character n-grams will be applied, while Word2Vec does not consider character n-grams. I think I can set the minn to a large number to make sure character n-gram is disabled, but doesn't that defeat the purpose of using fastText instead of Word2Vec?

EdouardGrave commented 5 years ago

No, maxn default is 6 for skipgram and cbow. Setting maxn to 0 means that no subwords will be used by fastText. The goal of this experiments is to make sure that you can reproduce word2vec results with fastText.

ssoheilmn commented 5 years ago

@EdouardGrave Thanks for working with me on this.

I'm a bit confused about the default values. Could you please tell me how you check the default values for arguments? I was looking at the documentation in fastText's main GitHub page here which mentioned the default value is 0.

Also, I'm still interested to run fastText with the subword training to be able to generate more accurate results, and get similar words for every given word, including out of vocabulary words. So the question remains: why fastText model gets worse as number of documents grows?

EdouardGrave commented 5 years ago

@ssoheilmn -- the default values are different for the supervised and unsupervised modes of fastText. The ones that are listed in the README correspond to the supervised mode. You can get the list of default values for a given command (e.g. skipgram) by running that command without any arguments:

$ ./fasttext skipgram

I understand that you are interested in using subwords :) I am just suggesting to run the code without subwords as a way to debug things. Also, have you also tried shuffling your data? I hope this is helpful. The model should not get worse when the training data size grows (I have never observed this, beyond small changes that are likely not statistically significant).

ssoheilmn commented 5 years ago

Thanks @EdouardGrave. I'll run my code with maxn=0 and see how that goes.

ssoheilmn commented 5 years ago

@EdouardGrave It appeared that the training part was done right, but the search script I implemented was not working correctly.

I was initially using the following code to calculate similarity: cossims = np.matmul(allVectors, wordVector, out = None) I borrowed this code from the find_nearest_neighbor() function, and I'm not sure if the code is wrong, or I'm mis-using.

Using the distance function from scipy.spatial solved the issue: cosDist = distance.cosine(wordVec, word)

facebookresearch / fastText

Why fastText's algorithm gets worse when number of notes increases? #702