MPAS-Dev / MPAS

Repository for private MPAS development prior to the MPAS v6.0 release.
Other
4 stars 0 forks source link

Fix thread race condition corrupting output of mpas native timer #1462

Open Larofeticus opened 6 years ago

Larofeticus commented 6 years ago

This is about the second issue mentioned by @mark-petersen in #1289

I noticed around 6 months ago that when running with 2+ threads, certain timer phases in the timer report have obviously incorrect results. Max/total/Average times orders of magnitudes larger than the actual run time, or often overflowing the format and just ***'s for output. I took a quick look, didn't see an obvious problem, and moved on because the timers are just a convenience.

I returned to this problem recently while I wait for my more important PR's to worm through the sausage factory.

It's not a problem with the reduction; the corruption happens when a very large time value is added, and only the first call to start/stop('noun') has the corrupt value. The master thread is always correct, other threads may have the problem. Running 30to10 with 8k ranks, ranks will randomly have the corruption with a probability of ~0.02.

It's fixed by adding a thread barrier before the start timer master-only code block that looks for timer tree nodes and creates new ones if not found. I can't explicitly explain why. It's related to the master thread falling behind the others such as two timer calls being separated by only a block of master only code, which explains why only certain report lines can have the problem.

Non-master threads call for wtime for start_time(:) and somehow the master thread setup_timer conditional then overwrites it to zero. Because times are relative to an epoc, 1.5 billion - 0 = 1.5 billion and that's where the absurd value comes from. I was able to confirm that the setup_timer(noun) correctly triggers once for each noun.

This problem itself is still only convenience, but it suggests there could be similar threading problems lurking in other systems that are just as difficult to analyze but without obvious *** warning flags. A cell field incorrectly being initilized to 0 when it should be 0.048 at the first time step is not going to be obvious when looking at the outcome.

Another concern is, much as surrounding code impacts a race in the timers, the thread barriers inside timer calls could be necessary for correct behavior in surrounding code. Running with timers turned off could expose new race conditions.

This is only a bug fix. It will perform a tiny bit slower because there's one more thread barrier per start_timer call.

mark-petersen commented 6 years ago

@Larofeticus Thanks! It is really great to nab these pesky problems, and this one has been long-standing. You have two commits, but only need the second one. Please do this on your local branch:

git fetch mpas
git reset --hard mpas/develop
git cherry-pick ca205c9
git push Larofeticus timer_race_fix --force

here mpas is the alias for the MPAS-Dev remote, and Larofeticus is the alias for your remote fork. You probably have different alias names.

Larofeticus commented 6 years ago

I did something...? Maybe that was it?

I've entered a weird new stage of Git ignorance where: I know the nouns, I know what my state is now, and I know exactly what I want to change it too...

but i have no idea what sort of black magic commands actually do exactly and only those changes, so I end up with something that's close but also smeared side effects all over

mark-petersen commented 6 years ago

I did a reset and cherry-pick to clean up the commit, as described above at https://github.com/MPAS-Dev/MPAS/pull/1462#issuecomment-350324560

mark-petersen commented 6 years ago

@Larofeticus could you verify that this still occurs on edison? I just tested on cori-knl and cori-haswell with two threads, and the timer output all appears OK. You can test the head of ocean/develop with commit fe6f219e8 with this file:

/global/cscratch1/sd/mpeterse/runs/tar_files/c71j_171212_QU240_performance_MPASO_hash_fe6f219e8.tar
then:
cd c71j/ocean/global_ocean/QU240/performance_test/forward/

link your executable, and run.

Larofeticus commented 6 years ago

I just ran a 2 thread job on edison with fe6f219e8089af402441bc258cceb46b66f31d4b using the input from that tar and the corrupted timer data is present.

mark-petersen commented 6 years ago

Perfect. Can you add this thread barrier on that last test and retest, just to be pedantic?

Larofeticus commented 6 years ago

Added barrier, problem not present.

Don't forget this was a randomized race condition thing and a single run doesn't guarantee correctness, but yeah I agree it's probably fine considering all the tests from other platforms too

mark-petersen commented 6 years ago

I did some before/after tests on two threads on my local cluster and did not see any effect on overall time. @Larofeticus how careful do you think we need to be about the possibility of this slowing down the code?

There are essentially timers in every subroutine, but they are required to be outside loops over cells or edges, so probably no harm done. Could you test for slowdown with this EC60to30 file on cori-knl, using multiple threads and multiple nodes?

/global/cscratch1/sd/mpeterse/runs/tar_files/c71k_171212_EC60to30_MPASO_hash_fe6f219e8.tar

Thanks. I'm just hesitant to put in an explicit thread barrier that is called in every subroutine without performance testing.

Larofeticus commented 6 years ago

the timers need this barrier to be correct, though. logically, if these extra calls were a performance priority then there would be a mechanism to disable them already.

This is probably not measurable above noise without well beyond absurd scaling, like 20 cells per rank or something like that.