BUG: miniapp crashes when running with more threads then elements

mt5555 commented 9 years ago

mini-app should only launch threads that have a non-zero number of elements.

halldm2000 commented 9 years ago

this bug is not confined to the mini-app. my tests show that HOMME exits without producing output when the number of threads exceeds the number of elements as well.

worleyph commented 9 years ago

See also ACME githhub issue #231. While I only saw this affecting diagnostic output, the issue could have wider implications. Perhaps it is related.

mt5555 commented 9 years ago

I think there are 3 solutions to this: the first, which is my preference, is to just change HOMME's "NThreads". This is set by the namelist, or passed in from CAM, and can be thought of as the maximum number of threads allowed (i.e. OMP_NUMTHREADS). We can just lower it to the correct value during initialization.

The second option seems to be recent changes in HOMME to keep NThreads, and allow for a second variable n_domains, ( I haven't looked at the code, but I'm deducing this from Pat's comments in https://github.com/ACME-Climate/ACME/issues/231 ) . But when n_domains < NThreads, we have some problems because a lot of code (such as the diagnostics) may still be assuming NThreads, and hence break (as in issue #231). So this option would be to go through the code looking for NThreads, understand if it should be using n_domains instead and fix it.

A third option - in code like the diagnostics, I think it is using Nthreads in a way which would be better done with OMP directives. i.e. counting threads that enter a region until the count reaches NThreads- shouldn't that instead be done by OMP BARRIER?

worleyph commented 9 years ago

i.e. counting threads that enter a region until the count reaches NThreads- shouldn't that instead be done by OMP BARRIER?

I don't think that this works. Any idle threads would never enter this section of the code (I believe), so the OMP_BARRIER would cause a hang.

mt5555 commented 9 years ago

HOMME calls "omp_set_num_threads()". So I think even if OMP_NUMTHREADS is set to 16 (for example), HOMME does not have to use all 16 - it can only activate a smaller number.
So I was thinking OMP BARRIER would only apply to the number of threads defined by omp_set_num_threads() ?

worleyph commented 9 years ago

I think that the issue is that any thread id greater than >= ndomains would never get into this section of the code, and so OMP_BARRIER would still hang because the error occurs when NThreads > ndomains. This is what causes the error in the diagnostic routine now - there are only ndomains threads with assigned work yet the logic assumes that there are NThreads. The threads that do not have work never get to this code, so the "count" is wrong and the answer is screwed up. Using OMP_BARRIER, the thread ids >= ndomains still would not get to this code, causing the hang.

worleyph commented 9 years ago

Perhaps we are not talking about the same thing. Are you talking about combining OMP_BARRIER with using omp_set_num_threads(ndomains)? I thought that using OMP_BARRIER was one of your solutions instead of changing the number of active threads.

mt5555 commented 9 years ago

The code looks like this:

n_domains = min(Nthreads,nelemd) call omp_set_num_threads(n_domains)

So I think an MPI_BARRIER would work (as opposed to counting threads until the count gets to Nthreads)

worleyph commented 9 years ago

Thanks. I didn't realize that "call omp_set_num_threads(n_domains)" was already in there. I guess that a critical section was considered less of a performance bottleneck than a hard synchronization.

mt5555 commented 9 years ago

looking some more at the code: I see a lot of problems in the standalone HOMME driver (that would not be present when running in CAM). Lots of uses of "Nthreads", which need to be replaced by omp_get_num_threads(). (or n_domains - but it looks like that is a local variable and not available outside prim_init1)

update: there is a global variable, "nthreadshoriz". The introduction of #ifdef's to allow for threading in the horizontal or vertical (but not both at the same time), and the creation of nthreadshoriz is probably what broke our ability to run with more threads then elements.

mt5555 commented 9 years ago

here's my vote as for the solution:

search for all uses of "nthreads". If we are in the threaded region, change to omp_get_num_threads(). If we using "nthreads" to start threads over elements, use "nthreadsHoriz".

Any other use of nthreads, that code should be recoded to use OMP directives instead.

jedbrown commented 9 years ago

So to be clear, it sounds like you envision that the dynamics will always be within omp parallel num_threads(num_active_dynamics_threads). There will never be a broader omp parallel num_threads(total_num_threads) with only some of those threads working on dynamics (the others idle or doing something else). Note that there is potentially synchronization and parallelism overhead to this strategy, but it may well be tiny compared to the overhead we're currently bearing with critical sections and the like.

mt5555 commented 9 years ago

That's correct, for our "horizontal openMP" implementation.

We also support a loop level openMP option (set at compile time, mutually exclusive with the horizontal openMP option). For that option, it might make sense to allow regions to use different numbers of threads, but currently it always uses the max amount of threads.

jedbrown commented 9 years ago

My concern is that if we assume the semantics of the "horizontal OpenMP implementation", we'll need to revisit this if we do something more flexible. My understanding of this code is that it's not an issue for loop-level OpenMP because this code is executed at a higher level.

mt5555 commented 9 years ago

In that case, I think the proposal in my comment 2 back from this one is the correct way to go: for horizontal openMP, we need to always start nthreadsHoriz - as that is the only thing that makes sense. But then for all other OMP operations, we rely on OMP directives and not assume anything about how many threads are running. This should be the most flexible.

halldm2000 commented 9 years ago

decompose code was simplified and is robust to having more threads than elements: l = end-start+1 ! get length q = l / nthr ! get quotient r = mod(l,nthr) ! get remainder n(:) = q ! set q elements per thread n(1:r)= q+1 ! add extra element to first r threads domain%start= start + sum( n(1:i) ) domain%end = domain%start + n(i+1) -1

halldm2000 commented 9 years ago

prim_driver: old code: n_domains = min(nthreads, nelemd) caused havoc when using more threads then elems. I got crashes all over the place in normal mpi calls: mpi_barrier, mpi_type_contiguous, etc. probably stack corruption.

determined that n_domains is no longer necessary. it used to be a switch between omp in the vertical and omp in the horizontal. that's not how the code works anymore.

eliminating n_domains and using nthreads throughout fixes the code for all horizontal and vertical thread decompositions. tested on ne8 case with mppwidth=768.

this is how the code is currently used in prim_main: !$OMP PARALLEL NUM_THREADS(nthreads), DEFAULT(SHARED),PRIVATE(ithr,nets,nete,hybrid) call omp_set_num_threads(vert_num_threads)

NUM_THREADS sets the number of horizontal threads and omp_set_num_threads assigns several nested threads to each of those.

namelist variable Nthreads should probably be changed to n_threads_horizontal to reflect its current usage...

E3SM-Project / transport_se

BUG: miniapp crashes when running with more threads then elements #10