Closed StevenLOL closed 7 years ago
... 5 hours later, the speed reduce to 7.4k word/sec/thread , and eta becomes 66 hours.
can you try if the phenomenon is the same with fasttext on your corpus, when using for example some fake labels for sentence classification?
Hi,
I didn't use labels for classification task, only want to try sent2vec, and my input is one sentence ( space separated tokens ) per line.
I tried train skip-gram model for FastText on the same data set, there is no such phenomenon for both the official FastText and this FastText.
fasttext skipgram -input $w2vdata -output ./w2v/fasttext_300_SG_w11_del -lr 0.025 -dim 300 -minCount 20 -neg 5 -loss ns -bucket 2000000 -epoch 5
# (a version I forked in 2016, 9bfa32d ("Add link to Google group", 2016-08-10))
Read 1421M words
Progress: 0.7% words/sec/thread: 22484 lr: 0.024834 loss: 1.016304 eta: 7h16m
#the speed didn't reduce as what happened to sen2vec training
Read 1421M words
Number of words: 555571
Number of labels: 0
Progress: 0.7% words/sec/thread: 24944 lr: 0.024835 loss: 1.204676 eta: 6h33m
hey @StevenLOL , I'm currently running some tests to see if the recent commit broke smthg, but I'm quite confident it didn't. In the meantime, check your data, you might have some very long sentences at some point that slow down the algorithm midway. I would recommend filtering your sentences with a max_sentence_length threshold and see if that solves the issue.
Oh, you're right, there are wiki pages saved as one line in my database. After split them into multiple lines, everything is fine now.
By the way would you please comments on how sentence length affect word / sentence vector ?
Glad the problem is fixed :) !
By the way would you please comments on how sentence length affect word / sentence vector ?
Sent2vec vectors are created by simply averaging constituant ngram vectors. The longer the chuck of text is, the more constituants you average. Averaging properly requires a good weighting scheme, those weights are learned during training, encoded in the norm of the word vectors. The learned weights will be different for long and short texts, therefore a sent2vec model trained on short pieces of text would not generalizes well to long documents. Due to the additive nature of the embedding, sent2vec is more suited to sentences. More tests should be done to see if sent2vec could also be used on longer pieces of texts, on one small experiment I did on the IMDB dataset I found similar performances between sent2vec and doc2vec. But sent2vec is before all a sentence embedding model.
Concerning the word embeddings, we did not study the impact of the text length on the word embeddings.
Nice points, thanks.
Hi, I try to train a model on 1421M words, and found that the word/sec/thread are keep decreasing, eg from 100k at 0.0% reduce to 8k at 11% , and now the eta is more than 60 hours.
Here is my script for training:
./fasttext sent2vec -input my_sentences.txt -output my_model -minCount 8 -dim 300 -epoch 9 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 6 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000