epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Segmentaion fault when running cbow-c+w-ngrams on linux CentOS 7 #75

Closed C-Dongbo closed 5 years ago

C-Dongbo commented 5 years ago

./fasttext cbow-c+w-ngrams -input ../data/new.txt -output ../model/new -minCount 100 -dim 300 -ws 10 -epoch 20 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 2000000 -bucketChar 1000000 -lrUpdateRate 10000 -minn 3 -maxn 6

Read 6205M words Number of words: 291863 Number of labels: 0 Progress: 0.1% words/sec/thread: 85004 lr: 0.199755 loss: 21.799864 eta: 20h15m ./training_cbowngrams.sh: line 1: 10367 Segmentation fault

I don't know how to solve segmentation fault. Is there a good solution??

mpagli commented 5 years ago

Hi. Can you manage to reproduce this with a small corpus? Is it crashing at the beginning or more in the middle?

C-Dongbo commented 5 years ago

Hi. I experimented with various small corpus. (10,000 lines, 100,000 lines, 1,000,000 lines) 10,000 and 100,000 lines small corpus didn't make error. But 1,000,000 lines small corpus make error in the middle ( 21012 Segmentaion fault).

martinjaggi commented 5 years ago

sorry we can't reproduce it if you don't provide a working example of the case. maybe you have some extremely long lines in that larger corpus? maybe try with fasttext as well to see if it will also fail on the same corpus or not?