HPCE / hpce-2017-cw6

2 stars 17 forks source link

ClosePairs snpair_mNP2S_Flag #32

Open ENZO-F opened 6 years ago

ENZO-F commented 6 years ago

The _snpair_mNP2SFlag flag is marked as safe for concurrency in snpair.cpp:91, but if the tests for the Crush battery are executed such that each j2 value is done in parallel, can we not end up in a situation where the single test that requires _snpair_mNP2SFlag=FALSE is executed first then another thread sets _snpair_mNP2SFlag=TRUE before the first test is finished?

m8pple commented 6 years ago

Yes, you're right. I'd noticed that the flag only affected printing, so (incorrectly) assumed it would only be flipped on a per-program basis, not within a battery. I should have more carefully grepped out the uses.

It never showed up on the long term multi-threaded stress tests, as the printing was always disabled for those, so the p-values at the end always matched up exactly. However, if people flip the flags, it could result in output that looks different (though due to the garbling of stdout during multi-threaded execution, it is difficult to tell).

So this is a real bug, and is actually on the live battery path taken in the tests. It is also a genuine concurrency bug, that will cause non-determinism within an otherwise safe task-based program.

On the other hand, it does not affect the actual p-values, so is less a correctness issue. It also requires people to flip the display flag (though I know that people are trying that to explore), and then to go further and somehow isolate the stdout from separate tests (though I've seen some people play with ways of doing that).

So I'll assign a group bounty of 1% to this - it's not important for correctness, but it is intimately linked with the parts of code that people will need to look at.

I've prepared a fix on another branch, but will need to think carefully before merging, as there is a danger that it confuses people while fixing a bug that will never be exercised.

Thanks!