cocreature / thrill

Thrill - An EXPERIMENTAL Algorithmic Distributed Big Data Batch Processing Framework in C++
http://project-thrill.org
Other
0 stars 0 forks source link

Precisions > 14 #6

Closed cocreature closed 7 years ago

cocreature commented 7 years ago

For some reason the tests do not seem to terminate in a reasonable timeframe if a precision > 14 is used while it terminates in a few miliseconds otherwise. I don’t see why a precision of 15 should take more than twice the time so there is probably a bug somewhere.

cocreature commented 7 years ago

This is really weird. If I put a print statement before the call to HyperLogLog<15> in the test it is never executed. Thrill uses a ton of CPU but my actual code does not seem to be executed. Maybe some memory allocation fails and this is not handled correctly?

cocreature commented 7 years ago

The culprit is seems to be the call to allreduce in Execute. Removing this line causes it to work. So maybe thrill can’t handle big arrays?

cocreature commented 7 years ago

I’m seeing the same problem, i.e. the lambda we’re passing to Thrill is never executed, after I’ve implemented the sparse representation but now it happens even for a precision of 4. There is something really weird going on here.

@TiFu Do you have any idea what could be causing this? Otherwise we probably need to ask Timo since me digging through the Thrill internals to figure out why the function is not called is probably not going to be very productive.

TiFu commented 7 years ago

I'll finish the second improvement (90% done) and then take a look at this issue.

Moritz Kiefer notifications@github.com schrieb am Do., 12. Jan. 2017 15:12:

I’m seeing the same problem, i.e. the lambda we’re passing to Thrill is never executed, after I’ve implemented the sparse representation but now it happens even for a precision of 4. There is something really weird going on here.

@TiFu https://github.com/TiFu Do you have any idea what could be causing this? Otherwise we probably need to ask Timo since me digging through the Thrill internals to figure out why the function is not caled is probably not going to be very productive.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/cocreature/thrill/issues/6#issuecomment-272172573, or mute the thread https://github.com/notifications/unsubscribe-auth/AFs3ZyGd5xV6PQtkYgd9ObvBIBgMx7q_ks5rRjS3gaJpZM4LhInG .

TiFu commented 7 years ago

The only thing which changes by increasing p is the number of registers (and number of bits used for the hashvalue).

Does the sparse representation create large arrays? (>= 2^15)

If the culprit is really the call to allreduce, large arrays could be the culprit. Might be easier to talk to Timo about limitations regarding the array size. in thrill, than to dig through thrill internals.

Can you send him an e-mail and ask if he knows if there is any limitation?

cocreature commented 7 years ago

Does the sparse representation create large arrays? (>= 2^15)

No it doesn’t. Also the call to allreduce does not seem to be the culprit in that case. I’ll try to come up with a minimal example and then send it to Timo.

cocreature commented 7 years ago

So here are the results of my investigation so far: One thread seems to be in the malloc tracker https://github.com/cocreature/thrill/blob/thrill-bug/thrill/mem/malloc_tracker.cpp#L213. GDB claims that tl_stats points to 0 which would explain why we crash. Now I don’t know why this should be the case, but it gets even weirder:

When trying to insert a printf statement at this point that prints the dress of tl_stats, I suddenly get an error message malloc_tracker ### init heap full !!! and the program immediately stops. This happens for all register sizes, even 0.

The next step is disabling HAVE_THREAD_LOCAL. I haven’t yet figured out what the problem in that case is but I’ll continue digging into this tomorrow.

cocreature commented 7 years ago

It looks like I was wrong: the problem is not caused by malloc tracker but by the call to the send syscall here. This call doesn’t seem to return. I have yet to figure out why that’s the case.

cocreature commented 7 years ago

Alright, I found the problem: The hypercube reduce implementation procedes by first having both nodes call send on their data and then they both call recv. This works fine if the buffers are large enough that the send calls return. However if the buffers are full send blocks until recv is called. This results in a deadlock.

An easy workaround is to just increment the buffer size e.g. by removing the if around https://github.com/cocreature/thrill/blob/master/thrill/net/tcp/socket.hpp#L61

A proper fix would be to swap the order of send and recv calls for one node but for now we can at least continue working until I get around to implementing that.

cocreature commented 7 years ago

I’ve pushed a really hacky fix that swaps the order of send and recv calls and thereby avoids the deadlock.

TiFu commented 7 years ago

Nice work! Did you send Timo an E-Mail about this issue?

cocreature commented 7 years ago

Yep, he confirmed that this is a bug.