Open amametjanov opened 2 years ago
This is threading problem, test runs without using threading passed without any problem, which means there is no problem in the code.
@amametjanov I could reproduce this error. Let me know if you already filed this issue at Cray. Otherwise, I will create one through OLCF help desk.
@grnydawn Not yet, please do. Thanks.
It seems that the direct cause of this issue is related to un-initialized variable on a particular thread. Someone who knows well this code may need to look at this issue.
The value of "self%accum_method" in the following code should be either accum_null(0) or accum_mean(1), but the value of the variable was "-1" (un-initialized value) on a certain thread.
In "E3SM/components/eam/src/physics/cam/micro_mg_data.F90"
subroutine MGFieldPostProc_accumulate(self) class(MGFieldPostProc), intent(inout) :: self
select case (self%accum_method) case (accum_null) ... case (accum_mean) ... case default call shr_sys_abort(errMsg(FILE, LINE) // & " Unrecognized MGFieldPostProc accumulation method.") end select end select
@singhbalwinder and @wlin7 Who would be the right contact to look at fixing this threading issue/fix in micro_mg_data.F90?
It may not be related, but I had been tracking down a problem on a different machine with runtime errors in this same source file. I had tests that failed, but only in DEBUG and only with threads in ATM. A work-around in my case was to change the flavor of fortran ASSERT to avoid the need for a temporary error mesg string.
Actually, re-reading the original comment, I see your error happens in NON-DEBUG builds, so this is surely not the issue.
@singhbalwinder mentioned this bug to me today. I wrote this module when I worked at NCAR about a decade ago; at that point it was part of an experiment in writing more object-oriented/generic codes using Fortran 2003 features and preprocessing, e.g. to create and use containers kind of like those in the C++ standard library. In some cases that worked out OK, but micro_mg_data
is probably the least popular piece of code I've ever written. I hear that CESM removed it entirely in recent years.
One big problem with the module is that not all compilers have implemented Fortran 2003 completely or correctly, and in particular I don't think we were supporting the Cray compiler at all when this code was written. Also, I think that the unit test suites that were originally used to check for such compiler issues never made it into E3SM at all, possibly because they only covered a handful of modules like this one, and relied on a particular version of pFUnit that no one wanted to have as a dependency.
Anyway, I don't know for sure what is happening here, but my guess is one of two things:
post_proc
variable defined here is for some reason being given the save
attribute, which means that OpenMP treats it as a shared variable. In that case, you can try a couple of fixes:
a) Add a directive like !$omp threadprivate(post_proc)
where the variable is declared, which would be the easiest thing to try.
b) Remove all default initializations from the micro_mg_data
types, since these may be causing the compiler to incorrectly infer that the save
attribute applies to every instance of these types. So in particular, you should remove all the default values (including null()
pointer initializations) from these lines, and probably also the zero size initialization from the container "template" file here.SUMMITDEV_PGI
is defined in the preprocessor.I think the plan is to replace this with P3. So maybe we will eventually just delete all this micromg code? I'll ask.
@rljacob That would also work fine if you only want to run P3 on this machine/compiler. It will still be a problem if you want to run v2 though, or if you wanted to get OpenMP tests working on master before P3 is actually merged.
I'm assuming that none of the MG code was ever planned to be in v4 anyway. The question at this point is just whether it's worth the effort to remove the code before the switch to EAMxx makes it a moot point anyway.
Yes it would be worth fixing at least for the maint-2.x (and maint-1.x) branches. Short term, we'll only be running P3 cases on crusher/Frontier (SCREAM or MMF) so this doesn't need to work there except to get our test suite passing.
Threaded runs with
and
are erroring out with
Back-trace of core-dumps show
in some runs and in others
Building
micro_mg_data.F90
with-hipa0 -hzero -O0 -hvector0
is not helping.Tagging @grnydawn @sarats @abbotts @mattdturner