Getting a segfault for my data.

flipsidetalk commented 7 years ago

Hi! I wanted to train a new model on a dataset of new articles from various sources that I made. I cleaned and tokenized the dataset into one .txt file. I keep getting a segfault with the following output:

Siddharths-MacBook-Pro-2:sent2vec siddharth$ ./fasttext sent2vec -input all_sentences.txt -output my_model -epoch 9 -thread 4 Read 9M words Number of words: 45770 Number of labels: 0 Progress: 42.2% words/sec/thread: 205950 lr: 0.115647 loss: 2.235114 eta: 0h1m Segmentation fault: 11

I'm not sure if this is a memory issue that requires me to use a machine with more RAM or an issue in my input data. As I understood, the input .txt file should contain lowercase sentences separated by new lines with no non-alphabetic chars. If this is wrong, what should the exact format of my data input be?

martinjaggi commented 7 years ago

did you check if the problem does also occur with smaller standard input or not, and that your custom input data doesn't contain some extremely long lines?

mpagli commented 7 years ago

I don't think the problem is you not having enough RAM, once the matrices allocated, you don't have huge memory overhead. The input format is indeed one sentence per line, but I don't think the problem is coming from here as well. Did you check that the loss is properly decreasing ?

mpagli commented 7 years ago

Hey @flipsidetalk, any news ? You managed to solve your issue ?

YannDubs commented 6 years ago

Hi @martinjaggi @mpagli could we reopen this issue please : I have the same problem. A bit of context : I am working with the Thai and Vietnamese alphabet in addition to the Latin one (but it should be fine as you use utf-8). I wanted to find the phrases that cause the segfault but very surprisingly the segfault disappears when the dataset contains less than 600 phrases (i.e if I have a segfault with a dataset of 800 example and run sent2vec on the first 500 examples and the last 500 example I don't have it anymore!)

It also seems that the segfault appears randomly when he number of phrases is less than 800 examples. I.e for my dataset it never appears for less than 600 phrases, appears 50% of the time when the dataset is between 600 and 800 phrases, always appears when more than 800 phrases ! Of course I don't think this anything special with these numbers but it gives you an order of magnitude.

I can unfortunately not share the data which means that it will be hard for you to help me but do you have any suggestions for what I could try to find the root cause and share it with you ?

PS: it's worth mentioning that I do not have any issues with fasttext on the same dataset.

YannDubs commented 6 years ago

I just found out that using -dropoutK 0 removes the segfault. I still do not completely understand as if it were only the length of the phrase I should have the error when further splitting the dataset.

mpagli commented 6 years ago

Can you try to filter out all the sentences shorted than 2 or 3 tokens and check if the bug is still there ?

YannDubs commented 6 years ago

I have no sentences with less than 3 words. Filtering out 3 words didn't help either. And I tried with wordNgrams = 1 and 2

mpagli commented 6 years ago

One thing that might happen is that you might have some very long sentences in your corpus. Fasttext uses this check when reading a line:

if (ntokens > MAX_LINE_SIZE && args_->model != model_name::sup) break;

Where MAX_LINE_SIZE is by default 1024. This check is deactivated for supervised fasttext and sent2vec. It's hard to debug without having the corpus, but I would try to see if you have some very long sentences hidden in your data.

YannDubs commented 6 years ago

Thanks @mpagli I didn't realise there was this boundary for line length. This was the problem ! In case you were wondering I stop having this issue with MAX_LINE_SIZE=450 (still have it for 500). This explains why I only had this issue with sent2vec and not fasttext's skipgram.

cpury commented 6 years ago

I have this problem. It's extremely annoying. It seems to happen completely at random.

Before, I once managed to pass the whole training on the same data, but with a lower dimensionality and epoch count. Now for my final model, I'd like to ramp up both. So far I've tried 4 times, and each time I get a segfault at a different point in the training.

I have plenty of RAM left and after reading this I filtered out all lines with more than 450 characters. It's still happening.

Interesting note though: I also had the same happen to me when I trained FastText word vectors on the same dataset. After several tries, it finally worked. So it might be a bug in their code. Either way it's extremely frustrating and expensive.

cpury commented 6 years ago

Ok it seems I managed to fix it. It was probably related to some lines having too many words... I added the following preprocessing steps to my pipeline and one of them must have done the trick:

awk 'NF<100' (drop sentences with 100 or more words)
awk 'NF>2' (drop sentences with 2 or less words)
iconv -f UTF-8 -t ASCII//TRANSLIT (convert unicode characters to ASCII)
sed '/^.\{450\}./d' (drop lines with more than 450 characters)
awk 'length($0)>15' (drop lines with 15 or less characters)

epfml / sent2vec

Getting a segfault for my data. #4