IntelligentSoftwareSystems / Galois

Galois: C++ library for multi-core and multi-node parallelization
http://iss.ices.utexas.edu/?p=projects/galois
Other
314 stars 133 forks source link

Locking issue with AdaptiveObim (PMOD) #349

Closed bozhiyou closed 4 years ago

bozhiyou commented 4 years ago
#0  0x00007ffff64e8387 in raise () from /lib64/libc.so.6
#1  0x00007ffff64e9a78 in abort () from /lib64/libc.so.6
#2  0x00007ffff64e11a6 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007ffff64e1252 in __assert_fail () from /lib64/libc.so.6
#4  0x0000000000411bff in galois::substrate::SimpleLock::unlock (this=0x2aaaab201080)
    at /workspace/GaloisCpp/libgalois/include/galois/substrate/SimpleLock.h:69
#5  0x0000000000411f4c in galois::substrate::PaddedLock<true>::unlock (this=0x2aaaab201080)
    at /workspace/GaloisCpp/libgalois/include/galois/substrate/PaddedLock.h:41
#6  0x0000000000446f7a in galois::worklists::AdaptiveOrderedByIntegerMetric<void FastMarching<true, galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>, galois::InsertBag<std::pair<double, unsigned int>, 0u>, unsigned int, std::pair<double, unsigned int> >(galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>&, galois::InsertBag<std::pair<double, unsigned int>, 0u>&)::{lambda(std::pair<double, unsigned int> const&)#2}, galois::worklists::internal::ChunkMaster<std::pair<double, unsigned int>, galois::worklists::ConExtLinkedStack, true, true, 32, true>, 0, true, false, 64, std::pair<double, unsigned int>, unsigned int, false, false, true>::slowPop(galois::worklists::internal::ChunkMaster<std::pair<double, unsigned int>, galois::worklists::ConExtLinkedStack, true, true, 32, true>::ThreadData&) (this=0x7fffffff8a00, p=...)
    at /workspace/GaloisCpp/libgalois/include/galois/worklists/AdaptiveObim.h:417
#7  0x0000000000445598 in galois::worklists::AdaptiveOrderedByIntegerMetric<void FastMarching<true, galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>, galois::InsertBag<std::pair<double, unsigned int>, 0u>, unsigned int, std::pair<double, unsigned int> >(galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>&, galois::InsertBag<std::pair<double, unsigned int>, 0u>&)::{lambda(std::pair<double, unsigned int> const&)#2}, galois::worklists::internal::ChunkMaster<std::pair<double, unsigned int>, galois::worklists::ConExtLinkedStack, true, true, 32, true>, 0, true, false, 64, std::pair<double, unsigned int>, unsigned int, false, false, true>::pop() (this=0x7fffffff8a00)
    at /workspace/GaloisCpp/libgalois/include/galois/worklists/AdaptiveObim.h:551
#8  0x00000000004441df in galois::runtime::ForEachExecutor<galois::worklists::AdaptiveOrderedByIntegerMetric<void FastMarching<true, galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>, galois::InsertBag<std::pair<double, unsigned int>, 0u>, unsigned int, std::pair<double, unsigned int> >(galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>&, galois::InsertBag<std::pair<double, unsigned int>, 0u>&)::{lambda(std::pair<double, unsigned int> const&)#2}, galois::worklists::internal::ChunkMaster<std::pair<double, unsigned int>, galois::worklists::ConExtLinkedStack, true, true, 32, true>, 0, true, false, 64, std::pair<double, unsigned int>, unsigned int, false, false, true>, void FastMarching<true, galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>, galois::InsertBag<std::pair<double, unsigned int>, 0u>, unsigned int, std::pair<double, unsigned int> >(galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>&, galois::InsertBag<std::pair<double, unsigned int>, 0u>&)::{lambda(auto:1 const&, auto:2&)#1}&, std::tuple<galois::disable_conflict_detection, galois::s_wl<galois::worklists::AdaptiveOrderedByIntegerMetric<{lambda(std::pair<double, unsigned int> const&)#2}, galois::worklists::internal<int, galois::worklists::internal::ChunkMaster, true, true, 32, true>, 0, true, false, 64, int, int, false, false, true>, {lambda(std::pair<double, unsigned int> const&)#2}&>, galois::loopname> >::runQueueSimple(galois::loopname::ThreadLocalData&) (this=0x7fffffff8980,
    tld=...) at /workspace/GaloisCpp/libgalois/include/galois/runtime/Executor_ForEach.h:249
#9  0x0000000000443363 in galois::runtime::ForEachExecutor<galois::worklists::AdaptiveOrderedByIntegerMetric<void FastMarching<true, galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>, galois::InsertBag<std::pair<double, unsigned int>, 0u>, unsigned int, std::pair<double, unsigned int> >(galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>&, galois::InsertBag<std::pair<double, unsigned int>, 0u>&)::{lambda(std::pair<double, unsigned int> const&)#2}, galois::worklists::internal::ChunkMaster<std::pair<double, unsigned int>, galois::worklists::ConExtLinkedStack, true, true, 32, true>, 0, true, false, 64, std::pair<double, unsigned int>, unsigned int, false, false, true>, void FastMarching<true, galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>, galois::InsertBag<std::pair<double, unsigned int>, 0u>, unsigned int, std::pair<double, unsigned int> >(galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>&, galois::InsertBag<std::pair<double, unsigned int>, 0u>&)::{lambda(auto:1 const&, auto:2&)#1}&, std::tuple<galois::disable_conflict_detection, galois::s_wl<galois::worklists::AdaptiveOrderedByIntegerMetric<{lambda(std::pair<double, unsigned int> const&)#2}, galois::worklists::internal<int, galois::worklists::internal::ChunkMaster, true, true, 32, true>, 0, true, false, 64, int, int, false, false, true>, {lambda(std::pair<double, unsigned int> const&)#2}&>, galois::loopname> >::go<false, false>() (this=0x7fffffff8980)
    at /workspace/GaloisCpp/libgalois/include/galois/runtime/Executor_ForEach.h:336
#10 0x0000000000441aa0 in galois::runtime::ForEachExecutor<galois::worklists::AdaptiveOrderedByIntegerMetric<void FastMarching<true, galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>, galois::InsertBag<std::pair<double, unsigned int>, 0u>, unsigned int, std::pair<double, unsigned int> >(galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>&, galois::InsertBag<std::pair<double, unsigned int>, 0u>&)::{lambda(std::pair<double, unsigned int> const&)#2}, galois::worklists::internal::ChunkMaster<std::pair<double, unsigned int>, galois::worklists::ConExtLinkedStack, true, true, 32, true>, 0, true, false, 64, std::pair<double, unsigned int>, unsigned int, false, false, true>, void FastMarching<true, galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>, galois::InsertBag<std::pair<double, unsigned int>, 0u>, unsigned int, std::pair<double, unsigned int> >(galois::graphs::LC_CSR_Graph<NodeData, void, true, false, false, void>&, galois::InsertBag<std::pair<double, unsigned int>, 0u>&)::{lambda(auto:1 const&, auto:2&)#1}&, std::tuple<galois::disable_conflict_detection, galois::s_wl<galois::worklists::AdaptiveOrderedByIntegerMetric<{lambda(std::pair<double, unsigned int> const&)#2}, galois::worklists::internal<int, galois::worklists::internal::ChunkMaster, true, true, 32, true>, 0, true, false, 64, int, int, false, false, true>, {lambda(std::pair<double, unsigned int> const&)#2}&>, galois::loopname> >::operator()() (this=0x7fffffff8980)
    at /workspace/GaloisCpp/libgalois/include/galois/runtime/Executor_ForEach.h:414

The corresponding try_lock() is at AdaptiveObim.h#L529 WRT the unlock() at AdaptiveObim.h#L417.

bozhiyou commented 4 years ago

This issue only exists under release build. Debug build works.

bozhiyou commented 4 years ago

Possible cause: the lock-unlock pair at L299 and L336 may violate other locks. A bitset should help. Working on this.

insertinterestingnamehere commented 4 years ago

Thanks for looking in to this. Did you mean to say that this only shows up in a Debug build? The traceback is for an assertion failure, and those can only happen in debug builds.

insertinterestingnamehere commented 4 years ago

My current best guess for what's going on is that there may be some double acquiring/releasing of a given lock going on and that the current locks in Galois aren't recursive. Depending on when both release calls happen, this could be harmless, but still trigger the assertion failure that we're seeing.

bozhiyou commented 4 years ago

Did you mean to say that this only shows up in a Debug build? The traceback is for an assertion failure, and those can only happen in debug builds.

Yes, my mistake.

My current best guess for what's going on is that there may be some double acquiring/releasing of a given lock going on and that the current locks in Galois aren't recursive. Depending on when both release calls happen, this could be harmless, but still trigger the assertion failure that we're seeing.

Most likely. It's fine as long as harmless.

insertinterestingnamehere commented 4 years ago

FWIW, it looks like this probably isn't harmless since there are a bunch of correctness errors that still show up in release mode. I'm trying out logging the locks/unlocks to see what's going on.

bozhiyou commented 4 years ago

If the correctness errors you've seen are just overflow of error bounds, that should not be caused by the unlock issue and you can just tune the option -e to a higher bound. This bound is not adaptive for now (2e-6) but should really be input dependent, and I'm not sure what's the reasonable bound in terms of algebra (max distance between cells or something?).

On Tue, Jul 21, 2020 at 16:47 Ian Henriksen notifications@github.com wrote:

FWIW, it looks like this probably isn't harmless since there are a bunch of correctness errors that still show up in release mode. I'm trying out logging the locks/unlocks to see what's going on.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/IntelligentSoftwareSystems/Galois/issues/349#issuecomment-662123400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6H577PIM4TYQ5G2GZRRJLR4YEGVANCNFSM4PCYEZFA .

insertinterestingnamehere commented 4 years ago

Even for a really huge input I'd be surprised if there were any correctness errors above 1E-6 and we're still seeing errors above 0.1. The error goes way down in the sequential case too. I'm surprised there's any error at all. Is the error reporting just from the local check at each node?

This is also concerning because the error is on the order of 1E-8 in the sequential case. The result should be completely deterministic regardless of the scheduling strategy. I think we're either seeing some kind of memory corruption or the adaptive OBIM scheduler is dropping work items somehow.

insertinterestingnamehere commented 4 years ago

New info: logging the lock and unlock operations shows that the corresponding lock operation did actually occur, but somehow the lock is getting unlocked before it should anyway.

bozhiyou commented 4 years ago

Yes, the sanity check is local at each node. Just put the final result back into the upwind formulation and compute the error.

Let me check if anything goes wrong elsewhere. For tests you can remove the prefix Adaptive from OderedByIntegerMetric in fmm.cpp (only one match).

On Tue, Jul 21, 2020 at 18:01 Ian Henriksen notifications@github.com wrote:

Even for a really huge input I'd be surprised if there were any correctness errors above 1E-6 and we're still seeing errors above 0.1. The error goes way down in the sequential case too. I'm surprised there's any error at all. Is the error reporting just from the local check at each node?

This is also concerning because the error is on the order of 1E-8 in the sequential case. The result should be completely deterministic regardless of the scheduling strategy. I think we're either seeing some kind of memory corruption or the adaptive OBIM scheduler is dropping work items somehow.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/IntelligentSoftwareSystems/Galois/issues/349#issuecomment-662149396, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6H57ZM2N5L6EFM6X3TVILR4YM35ANCNFSM4PCYEZFA .

insertinterestingnamehere commented 4 years ago

Yah, I'm seeing similar errors with just ordinary OBIM, so it looks like there's still some sort of lingering correctness issue with the fast marching code that's distinct from the assertion failure we're seeing here. That's nice since we can debug the two problems separately. Nothing about this assertion failure seemed likely to cause correctness errors.

insertinterestingnamehere commented 4 years ago

Okay, I've been mulling this over while doing other things. Here's another idea on what could be causing the unlock assertion failure: threads are unlocking locks they never held to begin with and we don't see the failure until the thread that actually should have kept ownership of the lock runs its unlock operation. I just checked the logs of which thread acquired and released which lock and they seem to support this theory.

Interestingly enough, the existing Galois SimpleLock doesn't actually include any kind of assertion to check that this hasn't happened either. I'm not sure why that is.

bozhiyou commented 4 years ago

Interesting, but how can threads be "unlocking locks they never held"? I've never looked deep into the threadpool code though. Thanks for figuring that out!

On Tue, Jul 21, 2020 at 22:21 Ian Henriksen notifications@github.com wrote:

Okay, I've been mulling this over while doing other things. Here's another idea on what could be causing the unlock assertion failure: threads are unlocking locks they never held to begin with and we don't see the failure until the thread that actually should have kept ownership of the lock runs its unlock operation. I just checked the logs of which thread acquired and released which lock and they seem to support this theory.

Interestingly enough, the existing Galois SimpleLock doesn't actually include any kind of assertion to check that this hasn't happened either. I'm not sure why that is.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/IntelligentSoftwareSystems/Galois/issues/349#issuecomment-662218744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6H5773EDZ3OZP7TQM2QO3R4ZLKDANCNFSM4PCYEZFA .

insertinterestingnamehere commented 4 years ago

My cursory reading of the SimpleLock code indicates that, at least for now, it doesn't track which thread acquired the lock, so any thread can come along and call "unlock" and it the call will run "successfully". All the assertion does is check that the lock is held, not that the thread unlocking it actually holds it.