Open Larofeticus opened 6 years ago
@Larofeticus Thanks! It is really great to nab these pesky problems, and this one has been long-standing. You have two commits, but only need the second one. Please do this on your local branch:
git fetch mpas
git reset --hard mpas/develop
git cherry-pick ca205c9
git push Larofeticus timer_race_fix --force
here mpas is the alias for the MPAS-Dev remote, and Larofeticus is the alias for your remote fork. You probably have different alias names.
I did something...? Maybe that was it?
I've entered a weird new stage of Git ignorance where: I know the nouns, I know what my state is now, and I know exactly what I want to change it too...
but i have no idea what sort of black magic commands actually do exactly and only those changes, so I end up with something that's close but also smeared side effects all over
I did a reset and cherry-pick to clean up the commit, as described above at https://github.com/MPAS-Dev/MPAS/pull/1462#issuecomment-350324560
@Larofeticus could you verify that this still occurs on edison? I just tested on cori-knl and cori-haswell with two threads, and the timer output all appears OK. You can test the head of ocean/develop with commit fe6f219e8 with this file:
/global/cscratch1/sd/mpeterse/runs/tar_files/c71j_171212_QU240_performance_MPASO_hash_fe6f219e8.tar
then:
cd c71j/ocean/global_ocean/QU240/performance_test/forward/
link your executable, and run.
I just ran a 2 thread job on edison with fe6f219e8089af402441bc258cceb46b66f31d4b using the input from that tar and the corrupted timer data is present.
Perfect. Can you add this thread barrier on that last test and retest, just to be pedantic?
Added barrier, problem not present.
Don't forget this was a randomized race condition thing and a single run doesn't guarantee correctness, but yeah I agree it's probably fine considering all the tests from other platforms too
I did some before/after tests on two threads on my local cluster and did not see any effect on overall time. @Larofeticus how careful do you think we need to be about the possibility of this slowing down the code?
There are essentially timers in every subroutine, but they are required to be outside loops over cells or edges, so probably no harm done. Could you test for slowdown with this EC60to30 file on cori-knl, using multiple threads and multiple nodes?
/global/cscratch1/sd/mpeterse/runs/tar_files/c71k_171212_EC60to30_MPASO_hash_fe6f219e8.tar
Thanks. I'm just hesitant to put in an explicit thread barrier that is called in every subroutine without performance testing.
the timers need this barrier to be correct, though. logically, if these extra calls were a performance priority then there would be a mechanism to disable them already.
This is probably not measurable above noise without well beyond absurd scaling, like 20 cells per rank or something like that.
This is about the second issue mentioned by @mark-petersen in #1289
I noticed around 6 months ago that when running with 2+ threads, certain timer phases in the timer report have obviously incorrect results. Max/total/Average times orders of magnitudes larger than the actual run time, or often overflowing the format and just ***'s for output. I took a quick look, didn't see an obvious problem, and moved on because the timers are just a convenience.
I returned to this problem recently while I wait for my more important PR's to worm through the sausage factory.
It's not a problem with the reduction; the corruption happens when a very large time value is added, and only the first call to start/stop('noun') has the corrupt value. The master thread is always correct, other threads may have the problem. Running 30to10 with 8k ranks, ranks will randomly have the corruption with a probability of ~0.02.
It's fixed by adding a thread barrier before the start timer master-only code block that looks for timer tree nodes and creates new ones if not found. I can't explicitly explain why. It's related to the master thread falling behind the others such as two timer calls being separated by only a block of master only code, which explains why only certain report lines can have the problem.
Non-master threads call for wtime for start_time(:) and somehow the master thread setup_timer conditional then overwrites it to zero. Because times are relative to an epoc, 1.5 billion - 0 = 1.5 billion and that's where the absurd value comes from. I was able to confirm that the setup_timer(noun) correctly triggers once for each noun.
This problem itself is still only convenience, but it suggests there could be similar threading problems lurking in other systems that are just as difficult to analyze but without obvious *** warning flags. A cell field incorrectly being initilized to 0 when it should be 0.048 at the first time step is not going to be obvious when looking at the outcome.
Another concern is, much as surrounding code impacts a race in the timers, the thread barriers inside timer calls could be necessary for correct behavior in surrounding code. Running with timers turned off could expose new race conditions.
This is only a bug fix. It will perform a tiny bit slower because there's one more thread barrier per start_timer call.