epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Segmentation fault when running sent2vec on a Mac #63

Closed DavidV17 closed 5 years ago

DavidV17 commented 5 years ago

$ ./fasttext sent2vec -input sentences_tokenized.txt -output sent2vec_active_file_01_15_model_10 -minCount 8 -dim 10 -epoch 9 -lr 0.1 -wordNgrams 2 -loss ns -neg 10 -thread 1 Read 2M words Number of words: 6783 Number of labels: 0 Progress: 38.2% words/sec/thread: 814458 lr: 0.061815 loss: 2.463046 eta: 0h0m Segmentation fault: 11

vsolovyov commented 5 years ago

I tried to run sent2vec in the debug mode and got this info from crash:

[...]
Crashed Thread:        2

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000018
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Segmentation fault: 11
Termination Reason:    Namespace SIGNAL, Code 0xb
Terminating Process:   exc handler [50163]

[...]

Thread 2 Crashed:
0   libsystem_malloc.dylib          0x00007fff77a2d2f2 tiny_free_no_lock + 896
1   libsystem_malloc.dylib          0x00007fff77a2ce75 free_tiny + 480
2   fasttext                        0x000000010166bcc9 std::__1::__libcpp_deallocate(void*, unsigned long) + 25 (new:272)
3   fasttext                        0x0000000101677424 std::__1::allocator<unsigned long>::deallocate(unsigned long*, unsigned long) + 36 (memory:1816)
4   fasttext                        0x00000001016773a5 std::__1::allocator_traits<std::__1::allocator<unsigned long> >::deallocate(std::__1::allocator<unsigned long>&, unsigned long*, unsigned long) + 37 (memory:1554)
5   fasttext                        0x000000010167735b std::__1::vector<bool, std::__1::allocator<bool> >::~vector() + 75 (vector:2825)
6   fasttext                        0x0000000101669625 std::__1::vector<bool, std::__1::allocator<bool> >::~vector() + 21 (vector:2826)
7   fasttext                        0x0000000101669a2f fasttext::Dictionary::addNgrams(std::__1::vector<int, std::__1::allocator<int> >&, int, int, std::__1::linear_congruential_engine<unsigned int, 48271u, 0u, 2147483647u>&) const + 1023 (dictionary.cc:401)
8   fasttext                        0x00000001016994ab fasttext::FastText::sent2vec(fasttext::Model&, float, std::__1::vector<int, std::__1::allocator<int> > const&) + 555 (fasttext.cc:469)
9   fasttext                        0x000000010169f273 fasttext::FastText::trainThread(int) + 1331 (fasttext.cc:961)
10  fasttext                        0x00000001016bbfce fasttext::FastText::train(std::__1::shared_ptr<fasttext::Args>)::$_4::operator()() const + 30 (fasttext.cc:1138)
11  fasttext                        0x00000001016bbf6d decltype(std::__1::forward<fasttext::FastText::train(std::__1::shared_ptr<fasttext::Args>)::$_4>(fp)()) std::__1::__invoke<fasttext::FastText::train(std::__1::shared_ptr<fasttext::Args>)::$_4>(fasttext::FastText::train(std::__1::shared_ptr<fasttext::Args>)::$_4&&) + 29 (type_traits:4339)
12  fasttext                        0x00000001016bbf15 void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, fasttext::FastText::train(std::__1::shared_ptr<fasttext::Args>)::$_4>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, fasttext::FastText::train(std::__1::shared_ptr<fasttext::Args>)::$_4>&, std::__1::__tuple_indices<>) + 37 (thread:343)
13  fasttext                        0x00000001016bbbd6 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, fasttext::FastText::train(std::__1::shared_ptr<fasttext::Args>)::$_4> >(void*) + 118 (thread:352)
14  libsystem_pthread.dylib         0x00007fff77a732eb _pthread_body + 126
15  libsystem_pthread.dylib         0x00007fff77a76249 _pthread_start + 66
16  libsystem_pthread.dylib         0x00007fff77a7240d thread_start + 13

Thread 2 crashed with X86 Thread State (64-bit):
  rax: 0x0000000021fde251  rbx: 0x00000000ca10e17d  rcx: 0x00000000ca0fe392  rdx: 0x0000000021fee03c
  rdi: 0x00000001017ed080  rsi: 0x00007ffd94900000  rbp: 0x0000700007f005b0  rsp: 0x0000700007f00530
   r8: 0x0000000000000000   r9: 0x0000000000000010  r10: 0x000000000000004c  r11: 0x900007ffd949a33d
  r12: 0x0000000000000080  r13: 0x0000000000000a34  r14: 0x00007ffd94900000  r15: 0x00007ffd949a33d0
  rip: 0x00007fff77a2d2f2  rfl: 0x0000000000010283  cr2: 0x0000000000000018
[...]

I don't know C++, so I may be wrong, but for me it looks like it died on the exit from the function Dictionary::addNgrams https://github.com/epfml/sent2vec/blob/master/src/dictionary.cc#L379 while trying to deallocate std::vector<bool> discard.

mpagli commented 5 years ago

As possible ways to debug this, could you indicate if you have empty lines in your training data, and what is the min and max length of your lines? Also, some tokens are special such as </s>. The code should be robust to those cases but as we trained on quite clean data (removing too short / long sentences) we might have missed an edge case.

vsolovyov commented 5 years ago

@mpagli I managed to get a core dump, and, digging through it, this is what I found:

It died with line_size = 940, so it's not an empty line. The training data also doesn't contain any weird tokens, it's all cleaned up.

It looks like the culprit is in the way how std::uniform_int_distribution<> uniform gets used. In this function it is defined as uniform(1, line_size). I checked C++ reference and it says:

Produces random integer values i, uniformly distributed on the closed interval [a, b], that is, distributed according to the discrete probability function

So what happens, I think, is that sometimes uniform produces token_to_discard == line_size. I actually will insert an assert and will see if it will trigger. And, if it's true, then it gets written to discarded out of range. This is what the C++ standard says about those:

std::vector operator [] won't throw any out of range error. It's a undefined behaviour to access element greater than the vector size using [] operator.

So basically on Linux it overwrites some memory and it goes unchecked, and on OSX it dies because of some security measures (maybe it's the System Integrity Protection? I don't know much about it).

vsolovyov commented 5 years ago

Okay, I tried to run it with assertion and it died happily:

Assertion failed: (token_to_discard < line_size), function addNgrams, file src/dictionary.cc, line 387.

I think it has this inclusive range [1, line_size] instead of [0, line_size - 1] because the person who was writing this code was a bit more used to a random library where it would use excluding range, like [1, line_size).

If I change it to line_size - 1 it runs fine for a long time. I'm not sure if the lower limit should also be changed to 0 from 1? Looks like it should, as the algorithm tries to uniformly discard some tokens, and the first token isn't exactly special.

I'll fix this in a PR

vsolovyov commented 5 years ago

@DavidV17 my fix for this issue got merged, did it fix your problem as well?