Closed bozhiyou closed 4 years ago
This issue only exists under release build. Debug build works.
Thanks for looking in to this. Did you mean to say that this only shows up in a Debug build? The traceback is for an assertion failure, and those can only happen in debug builds.
My current best guess for what's going on is that there may be some double acquiring/releasing of a given lock going on and that the current locks in Galois aren't recursive. Depending on when both release calls happen, this could be harmless, but still trigger the assertion failure that we're seeing.
Did you mean to say that this only shows up in a Debug build? The traceback is for an assertion failure, and those can only happen in debug builds.
Yes, my mistake.
My current best guess for what's going on is that there may be some double acquiring/releasing of a given lock going on and that the current locks in Galois aren't recursive. Depending on when both release calls happen, this could be harmless, but still trigger the assertion failure that we're seeing.
Most likely. It's fine as long as harmless.
FWIW, it looks like this probably isn't harmless since there are a bunch of correctness errors that still show up in release mode. I'm trying out logging the locks/unlocks to see what's going on.
If the correctness errors you've seen are just overflow of error bounds, that should not be caused by the unlock issue and you can just tune the option -e to a higher bound. This bound is not adaptive for now (2e-6) but should really be input dependent, and I'm not sure what's the reasonable bound in terms of algebra (max distance between cells or something?).
On Tue, Jul 21, 2020 at 16:47 Ian Henriksen notifications@github.com wrote:
FWIW, it looks like this probably isn't harmless since there are a bunch of correctness errors that still show up in release mode. I'm trying out logging the locks/unlocks to see what's going on.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/IntelligentSoftwareSystems/Galois/issues/349#issuecomment-662123400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6H577PIM4TYQ5G2GZRRJLR4YEGVANCNFSM4PCYEZFA .
Even for a really huge input I'd be surprised if there were any correctness errors above 1E-6 and we're still seeing errors above 0.1. The error goes way down in the sequential case too. I'm surprised there's any error at all. Is the error reporting just from the local check at each node?
This is also concerning because the error is on the order of 1E-8 in the sequential case. The result should be completely deterministic regardless of the scheduling strategy. I think we're either seeing some kind of memory corruption or the adaptive OBIM scheduler is dropping work items somehow.
New info: logging the lock and unlock operations shows that the corresponding lock operation did actually occur, but somehow the lock is getting unlocked before it should anyway.
Yes, the sanity check is local at each node. Just put the final result back into the upwind formulation and compute the error.
Let me check if anything goes wrong elsewhere. For tests you can remove the prefix Adaptive from OderedByIntegerMetric in fmm.cpp (only one match).
On Tue, Jul 21, 2020 at 18:01 Ian Henriksen notifications@github.com wrote:
Even for a really huge input I'd be surprised if there were any correctness errors above 1E-6 and we're still seeing errors above 0.1. The error goes way down in the sequential case too. I'm surprised there's any error at all. Is the error reporting just from the local check at each node?
This is also concerning because the error is on the order of 1E-8 in the sequential case. The result should be completely deterministic regardless of the scheduling strategy. I think we're either seeing some kind of memory corruption or the adaptive OBIM scheduler is dropping work items somehow.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/IntelligentSoftwareSystems/Galois/issues/349#issuecomment-662149396, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6H57ZM2N5L6EFM6X3TVILR4YM35ANCNFSM4PCYEZFA .
Yah, I'm seeing similar errors with just ordinary OBIM, so it looks like there's still some sort of lingering correctness issue with the fast marching code that's distinct from the assertion failure we're seeing here. That's nice since we can debug the two problems separately. Nothing about this assertion failure seemed likely to cause correctness errors.
Okay, I've been mulling this over while doing other things. Here's another idea on what could be causing the unlock assertion failure: threads are unlocking locks they never held to begin with and we don't see the failure until the thread that actually should have kept ownership of the lock runs its unlock operation. I just checked the logs of which thread acquired and released which lock and they seem to support this theory.
Interestingly enough, the existing Galois SimpleLock doesn't actually include any kind of assertion to check that this hasn't happened either. I'm not sure why that is.
Interesting, but how can threads be "unlocking locks they never held"? I've never looked deep into the threadpool code though. Thanks for figuring that out!
On Tue, Jul 21, 2020 at 22:21 Ian Henriksen notifications@github.com wrote:
Okay, I've been mulling this over while doing other things. Here's another idea on what could be causing the unlock assertion failure: threads are unlocking locks they never held to begin with and we don't see the failure until the thread that actually should have kept ownership of the lock runs its unlock operation. I just checked the logs of which thread acquired and released which lock and they seem to support this theory.
Interestingly enough, the existing Galois SimpleLock doesn't actually include any kind of assertion to check that this hasn't happened either. I'm not sure why that is.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/IntelligentSoftwareSystems/Galois/issues/349#issuecomment-662218744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6H5773EDZ3OZP7TQM2QO3R4ZLKDANCNFSM4PCYEZFA .
My cursory reading of the SimpleLock code indicates that, at least for now, it doesn't track which thread acquired the lock, so any thread can come along and call "unlock" and it the call will run "successfully". All the assertion does is check that the lock is held, not that the thread unlocking it actually holds it.
The corresponding
try_lock()
is at AdaptiveObim.h#L529 WRT theunlock()
at AdaptiveObim.h#L417.