epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

About the running speed #3

Closed StevenLOL closed 7 years ago

StevenLOL commented 7 years ago

Hi, I try to train a model on 1421M words, and found that the word/sec/thread are keep decreasing, eg from 100k at 0.0% reduce to 8k at 11% , and now the eta is more than 60 hours.

Here is my script for training:

./fasttext sent2vec -input my_sentences.txt -output my_model -minCount 8 -dim 300 -epoch 9 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 6 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000

StevenLOL commented 7 years ago

... 5 hours later, the speed reduce to 7.4k word/sec/thread , and eta becomes 66 hours.

martinjaggi commented 7 years ago

can you try if the phenomenon is the same with fasttext on your corpus, when using for example some fake labels for sentence classification?

StevenLOL commented 7 years ago

Hi,

I didn't use labels for classification task, only want to try sent2vec, and my input is one sentence ( space separated tokens ) per line.

I tried train skip-gram model for FastText on the same data set, there is no such phenomenon for both the official FastText and this FastText.

Script for skipgram word vector training:

fasttext skipgram -input $w2vdata -output ./w2v/fasttext_300_SG_w11_del -lr 0.025 -dim 300 -minCount 20 -neg 5 -loss ns -bucket 2000000 -epoch 5

results:

official fasttext


# (a version I forked in 2016, 9bfa32d ("Add link to Google group", 2016-08-10))

Read 1421M words
Progress: 0.7%  words/sec/thread: 22484  lr: 0.024834  loss: 1.016304  eta: 7h16m 

this fasttext

#the speed didn't reduce as what happened to sen2vec training 

Read 1421M words
Number of words:  555571
Number of labels: 0
Progress: 0.7%  words/sec/thread: 24944  lr: 0.024835  loss: 1.204676  eta: 6h33m
mpagli commented 7 years ago

hey @StevenLOL , I'm currently running some tests to see if the recent commit broke smthg, but I'm quite confident it didn't. In the meantime, check your data, you might have some very long sentences at some point that slow down the algorithm midway. I would recommend filtering your sentences with a max_sentence_length threshold and see if that solves the issue.

StevenLOL commented 7 years ago

Oh, you're right, there are wiki pages saved as one line in my database. After split them into multiple lines, everything is fine now.

By the way would you please comments on how sentence length affect word / sentence vector ?

mpagli commented 7 years ago

Glad the problem is fixed :) !

By the way would you please comments on how sentence length affect word / sentence vector ?

Sent2vec vectors are created by simply averaging constituant ngram vectors. The longer the chuck of text is, the more constituants you average. Averaging properly requires a good weighting scheme, those weights are learned during training, encoded in the norm of the word vectors. The learned weights will be different for long and short texts, therefore a sent2vec model trained on short pieces of text would not generalizes well to long documents. Due to the additive nature of the embedding, sent2vec is more suited to sentences. More tests should be done to see if sent2vec could also be used on longer pieces of texts, on one small experiment I did on the IMDB dataset I found similar performances between sent2vec and doc2vec. But sent2vec is before all a sentence embedding model.

Concerning the word embeddings, we did not study the impact of the text length on the word embeddings.

StevenLOL commented 7 years ago

Nice points, thanks.