Open ssoheilmn opened 5 years ago
Hi @ssoheilmn,
Thank you for your question!
What kind of data are you using to train your model? One potential explanation is that the supervised task you are using to learn the model is not a good proxy to learn a good similarity function between word vectors. Have you tried using unsupervised models such as cbow or skipgram to learn the word vectors?
Best, Edouard
Hi @EdouardGrave,
Thanks for the reply.
I am using fastText in an unsupervised task of reading clinical notes and creating the word embeddings. Then I can find similar words to any given word utilizing proximity in the word embedding space. I have done the exact same thing using Word2Vec before and the algorithm just got better and better as number of documents increased.
Am I supposed to pass an argument other than "supervised" in order to achieve word embedding training in the command below?
$ ./fasttext supervised -input train.txt -output model
Hi @ssoheilmn,
In order to learn (unsupervised) word embeddings, you should use the skipgram
(or cbow
) subcommand, instead of supervised
. For example, you can use:
$ ./fasttext skipgram -input train.txt -output model
There are various hyper-parameters that you can try to change, to see the impact on performance:
-minn
and -maxn
. I suggest trying something like -minn 4 -maxn 6
;-dim
. The default is 100
, and I suggest trying values in the range 100-300
;-neg
. The default is 5
, and I suggest trying values in the range 5-20
;-epoch
. The default is 5
, and I suggest trying values in the range 5-50
.Best, Edouard.
@EdouardGrave,
You are right. I just checked my code and realized I was using the skipgram argument, exactly as you mentioned. I made a mistake reporting the wrong argument when I was writing the question. I just fixed my initial post.
I also have played with couple of parameters such as -minn, -maxn, -dim, and -minCount, but none helped with this specific problem.
You can try setting maxn
to 0
and should get similar results to word2vec
.
Could you please explain why? The default value for maxn is 0, with which I had the same issue of getting worse results while corpus size increases. I'm thinking wouldn't maxn = 0 mean no maximum, so still character n-grams will be applied, while Word2Vec does not consider character n-grams. I think I can set the minn to a large number to make sure character n-gram is disabled, but doesn't that defeat the purpose of using fastText instead of Word2Vec?
No, maxn
default is 6
for skipgram
and cbow
. Setting maxn
to 0
means that no subwords will be used by fastText. The goal of this experiments is to make sure that you can reproduce word2vec results with fastText.
@EdouardGrave Thanks for working with me on this.
I'm a bit confused about the default values. Could you please tell me how you check the default values for arguments? I was looking at the documentation in fastText's main GitHub page here which mentioned the default value is 0.
Also, I'm still interested to run fastText with the subword training to be able to generate more accurate results, and get similar words for every given word, including out of vocabulary words. So the question remains: why fastText model gets worse as number of documents grows?
@ssoheilmn -- the default values are different for the supervised and unsupervised modes of fastText. The ones that are listed in the README correspond to the supervised mode. You can get the list of default values for a given command (e.g. skipgram
) by running that command without any arguments:
$ ./fasttext skipgram
I understand that you are interested in using subwords :) I am just suggesting to run the code without subwords as a way to debug things. Also, have you also tried shuffling your data? I hope this is helpful. The model should not get worse when the training data size grows (I have never observed this, beyond small changes that are likely not statistically significant).
Thanks @EdouardGrave. I'll run my code with maxn=0 and see how that goes.
@EdouardGrave It appeared that the training part was done right, but the search script I implemented was not working correctly.
I was initially using the following code to calculate similarity:
cossims = np.matmul(allVectors, wordVector, out = None)
I borrowed this code from the find_nearest_neighbor() function, and I'm not sure if the code is wrong, or I'm mis-using.
Using the distance function from scipy.spatial solved the issue:
cosDist = distance.cosine(wordVec, word)
I'm using this command to train a model:
$ ./fasttext skipgram -input train.txt -output model
Once the model is trained, I use a python script to find top 50 relevant words to a given word. When I train the model using default settings, I receive reasonable results if using only 10,000 documents for training. As I increase the number of documents to 100,000, 1,000,000, and 50,000,000, the results get worse and worse (suggested words get irrelevant).I tried running the algorithm with the following changes, but still results get irrelevant as the corpus size grows:
I know in the paper they reported that results do get slightly worse when the corpus size increases (Figure 1), but I'm wondering 1) why is this happening in general; 2) why is it happening in a very large scale for me; and 3) is there any remedy to this?