Closed worleyph closed 7 years ago
In subsequent tests, only the diagnostic output is affected by the bug, so perhaps this is innocuous, if annoying. I'm still worried that something serious might be tickled in future versions of the model if this is not fixed for real. Being able to spawn more threads than there is work in the dynamics is likely to continue to be useful for performance reasons, though nested OpenMP might require rewriting this logic anyway, so perhaps it will be fixed as a side effect of this other work.
I don't think diagnostics are negotiable, FWIW. I also hate the O(NThreads) critical path for the present implementation. As for solutions, it seems like we want a concept like a "subcommunicator" so we can identify which threads are participating and perform a scalable reduction. Does this issue really not appear anywhere else in the code?
Does this issue really not appear anywhere else in the code?
I was worried, but this is so basic I can't imagine that it would not be obvious if the code was broken elsewhere in the same way. That said, I have no way of proving this. The HOMME developers surely (?) would have noticed this during their own development and testing. Using more threads than there is work in the dynamics has been common practice for the other dycores (FV and Eulerian) for a long time, and I would have assumed that it would have shown up before now for SE as well, though it is not as important for SE.
Waving my arms here. @mt5555 has a couple of different suggestions for fixing this. I'd prefer that something trickle down from the HOMME repository, but perhaps we need to do something ourselves in the meantime. I can still add the error test, but will only do so if someone int he atmosphere group requests that I do this.
I think once David fixes: https://github.com/ACME-Climate/transport_se/issues/10, that fix can be backported to HOMME
FYI, the version of HOMME in cesm1_2_2 has very different logic. In prim_driver_mod.F90:
! =================================================================
! Set number of domains (for 'decompose') equal to number of threads
! for OpenMP across elements, equal to 1 for OpenMP within element
! =================================================================
n_domains = Nthreads
#if (defined ELEMENT_OPENMP)
n_domains = 1
#endif
and
call omp_set_num_threads(NThreads)
Also, I just looked again, and the critical section in the min and max routines also has a OMP_BARRIER surrounding it. Not sure what the motivation was for this implemenation.
In reduction_mod.F90, I propose we use a formulation like this:
!OMP MASTER reduction array initialization code !OMP END MASTER !OMP BARRIER
!OMP CRITICAL compute max over values contained by all threads !OMP END CRITICAL !OMP BARRIER
Such a structure avoids any knowlege of the number of threads, and thus would fix our current bug in a general way. (with the same number of barriers)
And to ensure this is fixed througout the code, I propose we remove "nthreads" from the hybrid struct.
I'll have to see your code first :-), but looks reasonable.
I just looked at the second logic bug that I pointed out in my original issue description, and I must have misread the code (or something has changed since I looked at it). ipe is 0:n_domains-1, not 0:NThreads-1, in the call to decompose, so this should be fine?
@mt5555 omp critical
is non-deterministic because the order is unspecified and all threads compete for a lock (so likely worse performance). I would prefer omp ordered
which is deterministic and passes a token instead of competing for a lock. Alternatively, we could get better parallelism using a log-depth reduction without locks. (I can implement that if you like. It's insane that OpenMP still doesn't have native support for such a fundamental operation.) Anyway, when using omp ordered
, you can elide the first omp barrier
because the master
thread is always thread id 0 and thus the first to execute.
Note that if the highest-level omp parallel
region contains more threads than are active here, we must use a nested region so that the omp barrier
operates only on the active threads. I assume this will be no great hardship, but I consider scoping of parallel constructs to be a serious weakness of OpenMP and am thus wary about the assumptions that end up being made.
This particular code is only doing a MAX(), and only for log file output. So I think it will be BFB in any order, hence I thought omp critical would be slightly better?
CAM's global reductions (sums) are much more performance critical, but that is a complex piece of code developed over many years that I'm not familiar with - I think Pat was one of the key developers. It's unrelated to this bug, but perhaps Pat could comment on if that could benefit from a log-depth reduction without locks?
True about max
being order-independent, but I would prefer being able to use the same pattern for all reductions and would expect omp ordered
to perform better because passing a token is generally cheaper than competing for a centralized lock.
So something like this? Can we assume the MASTER is always the first thread allowed into an OMP ORDERED block?
!OMP MASTER reduction array initialization code !OMP END MASTER
!OMP ORDERED compute max over values contained by all threads !OMP END ORDERED !OMP BARRIER
Yes, that's what I had in mind.
It's not guaranteed that the master thread will execute the first iteration of the ordered sequence.
So we have a contradiction here:
"...because the master thread is always thread id 0 and thus the first to execute."
and
"It's not guaranteed that the master thread will execute the first iteration of the ordered sequence."
Should we get to the bottom of this, or just through in another OMP BARRIER to be sure?
If it doesn't matter which thread performs initialization, omp single has an implicit barrier at the end, which can then be followed by omp critical.
Back to the reproducibility question ... anything we do here will not be reproducible (for operators such as SUM) with respect to changes in process or thread count. The "ordered" option (if is works the way that Jed proposes, and not Az) would only support repeatability, and then only if the MPI_Allreduce used the reproducible MPI_Allreduce option (which is specified as an environment variable, and we have no control of this inside the code?) or if we wrote our own allreduce. This is the reason for the complexity of the existing reproducible sum code.
Because of this I do not believe that it is worthwhile jumping through too many hoops to come up with a general solution if something simple is sufficient for MAX/MIN and for diagnostics. There are plenty of other issues in the code that it would be more profitable for us to focus on.
@amametjanov omp single
has to be implemented using a lock where as omp master
does not, thus omp master
should be faster than omp single nowait
unless the master thread happens to be lagging behind. This is a load balancing question.
omp critical
is bad because it involves competition for a lock and because the pattern is nondeterministic if copied/generalized to a non-associative operation like FP addition. Meanwhile, one could use omp ordered
and merely have the first loop index copy instead of combine. That actually involves fewer traversals of memory and avoids the nonsensical initialization to -9999 (which will give wrong results if used to compute the max of suitably negative numbers, so I would prefer it).
@worleyph Note that the MPI standard strongly recommends that MPI_Allreduce
be implemented to be repeatable even if that has a performance cost. That doesn't prohibit implementations from taking shortcuts, but the wording is very explicit. I believe that a single omp ordered
loop with copy on the first loop iteration is the shortest and simplest code for this purpose.
@jedbrown
MPI standard strongly recommends that MPI_Allreduce be implemented to be repeatable
Not the default on Titan I don't believe. The shared memory optimizations tend to play fast and loose with this. Sounds like your proposal is as simple as anything else being proposed, so I am not arguing against it.
If we want a deterministic pattern, then yes, I agree with omp master + omp ordered. For performance, IMHO, critical is better. There is a lot more that goes into ordered scheduling than obtaining a lock for a critical region.
@amametjanov I figured this was something worth testing, so I measured and unfortunately, see wins of up to a factor of 2 in either direction depending on compiler, array size, and number of threads. Different compilers sometimes show opposite dependence of performance on parameters.
my summary: whatever we do here, it will never be an appropriate template for an ACME global sum, so we dont have to worry about setting a good example in that respect. We should instead choose the one that is conceptually the simplest. To me, that is the OMP CRITICAL version, because it avoids the ambiguity (and conflicting statements) around how OMP ORDERED works.
In either case, the initialization to -9999 is crazy and can be replaced with any value from the input data.
Is this still an issue?
The bugs (array bounds violations) have been fixed when we brought in the latest version of HOMME.
There is some messy & pointless code that counts the number of threads that particpate in the max, and initialization with magic numbers that should be removed. Changing this to "minor"
Fixed by #1289.
I've identified what may be a serious bug in HOMME. (It is obviously causing errors in diagnostic output, but the issue could have much wider consequences.) There is a simple workaround however, so we can avoid this particular problem in our development and production runs.
The problem showed up on Mira, where it is useful to have more threads in the atmosphere dynamics than there are grid cells, so some threads will be idle during the dynamics. (This has been the case on other systems in the past, but Mira is high thread count friendly, so we bumped into this scenario here most recently. I have verified identical behavior on Titan.)
The global variable NThreads is set to the thread count specified in env_mach_pes.xml (I assume). In one particular example (ne30_g16, FAMIPC5, 900 processes, 8 threads per process):
Main:NThreads = 8 Main:n_domains= 6
This breaks the logic in the max and min functions in reduction_mod.F90 ,where threads are let into the critical section until a counter exceeds NThreads, at which point thread 0 calls an MPI_Allreduce. In this case, only 6 threads are active (have assigned work), so the calls to min and max are overlapping, e.g. 6 data from a call to min and the next two data from a call to max are "reduced" together before the call to the Allreduce. (I'm surprised that this does not cause a hang, but print statements have verified this behavior.)
The max/min operators are the only ones with this particular logic, but there is also code in, for example, decompose in domain_mod.F90 where
... domain%start=beg(ipe) domain%end =beg(ipe+1)-1
and ipe is 0:NThreads-1 . So this looks to be reading from random memory for thread 7, and could cause problems if thread 7 decides that it has work to do? A HOMME expert will have to comment.
In any case, it LOOKS like we are safe if we do not overdecompose the model. I can add an error abort in case we do that by accident until this is fixed. Comments?