epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Fix segfaults on OSX #73

Closed vsolovyov closed 5 years ago

vsolovyov commented 5 years ago

As investigated in https://github.com/epfml/sent2vec/issues/63:

uniform_int_distribution produces random integer values i, uniformly distributed on the closed interval [a, b]. Sometimes it produced a value that was equal to line_size, hence the out of bounds access on vector<bool> discard, which produced segfaults on OSX on the exit from the function.

mastak commented 5 years ago

What situation with this PR?

I faced with same issue and it looks like solution. This PR fixed error, at least with my data.

mpagli commented 5 years ago

The PR looks good, I'm just puzzled by how something that big was able to slip in the cracks. Especially after running the code on TBs of sentences without any issue. Anyhow, LGTM...

vsolovyov commented 5 years ago

@mpagli I think it slipped through because you need a really unique combination for it to segfault in Linux. This loop runs only dropoutK times per line, which is 2 by default. Then, discard[token_to_discard] will go beyond allocated memory only if the end of vector<bool> discard is lined up with a 4k page, the next page is not allocated and uniform_int_distribution returned max value.

Quick search says that vector<bool> is a special one and can store values in individual bits, hence it's really-really rare for its end to line up with an end of the page. How long of a sentence you need? 32 768 words exactly?

Also, I think this vector is allocated on stack (no new), and not on the heap, and it could mean it must align with not just a random memory page, but with the end of the stack for it to segfault, which is pretty much unreal. I could be wrong here, as I don't know C++.