epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Is there a limitation of the size of train set #122

Closed shyyhs closed 2 years ago

shyyhs commented 2 years ago

When I use the command to train the model, no matter how large the corpus is, the code only reads 30M words like this:

Read 30M words Number of words: 136226 Number of labels: 0 Progress: 14.5% words/sec/thread: 21063 lr: 0.171061 loss: 2.602286 eta: 0h9m

Here is the training command I use:

$FASTTEXT \ sent2vec -input $INPUT_FILE -output $OUTPUT_FILE \ -wordNgrams 2 \ -dim 768 \ -minCountLabel 20 \ -minCount 8 \ -dropoutK 4 \ -loss ns \ -neg 10 \ -lr 0.2 \ -epoch 9 \ -t 0.000005 \ -neg 10 \ -thread 20 \ -numCheckPoints 1 \ -bucket 4000000 \ -bucketChar 2000000 \

Did I miss something?

shyyhs commented 2 years ago

I used several corpora that all contain 30M words.