E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
348 stars 358 forks source link

Threading issue with Cray compiler on Crusher #5213

Open amametjanov opened 2 years ago

amametjanov commented 2 years ago

Threaded runs with

./cime/scripts/create_test SMS_P12x2.ne4_oQU240.WCYCL1850NS.crusher_crayclang.allactive-mach_mods

and

  1. PET_Ln5.ne4_oQU240.F2010.crusher_crayclang.allactive-mach-pet
  2. PET_Ln9_PS.ne30pg2_EC30to60E2r2.WCYCL1850.crusher_crayclang.allactive-mach-pet

are erroring out with

 5:  ERROR: ERROR in /gpfs/alpine/cli133/proj-shared/testing/E3SM/components/eam/src/physics/cam/micro_mg_data.F90 at line 405

Back-trace of core-dumps show

(gdb) bt
#0  0x000015554d615ed9 in omp_free () from /opt/cray/pe/cce/14.0.0/cce/x86_64/lib/libcraymp.so.1
#1  0x000015554cf2d3ff in _DEALLOC () from /opt/cray/pe/cce/14.0.0/cce/x86_64/lib/libf.so.1
#2  0x000015554ced93c0 in __single_dealloc () from /opt/cray/pe/cce/14.0.0/cce/x86_64/lib/libf.so.1
#3  0x000015554cee26f0 in alloc_cpnts.constprop () from /opt/cray/pe/cce/14.0.0/cce/x86_64/lib/libf.so.1
#4  0x000015554cee3aa5 in deep_loops.part.0.constprop () from /opt/cray/pe/cce/14.0.0/cce/x86_64/lib/libf.so.1
#5  0x000015554cee3eae in _F90_COPY_POLYMORPHIC () from /opt/cray/pe/cce/14.0.0/cce/x86_64/lib/libf.so.1
#6  0x000000000140f710 in set_single_vec$micro_mg_data_ ()
    at /gpfs/alpine/cli133/proj-shared/azamat/e3sm_scratch/crusher/PET_Ln5.ne4_oQU240.F2010.crusher_crayclang.allactive-mach-pet.JNextIntegration20220928_010621/bld/crayclang/mpich/nodebug/threads/mct/mct/noesmf/c1a1l1i1o1r1g1w1i1e1/include/dynamic_vector_procdef.inc:220
#7  0x000000000141078b in push_back_vec$micro_mg_data_ ()
    at /gpfs/alpine/cli133/proj-shared/azamat/e3sm_scratch/crusher/PET_Ln5.ne4_oQU240.F2010.crusher_crayclang.allactive-mach-pet.JNextIntegration20220928_010621/bld/crayclang/mpich/nodebug/threads/mct/mct/noesmf/c1a1l1i1o1r1g1w1i1e1/include/dynamic_vector_procdef.inc:308
#8  0x0000000001413ea3 in add_field_2d$micro_mg_data_ () at micro_mg_data.F90:507
#9  0x00000000013cdcb6 in micro_mg_cam_tend$micro_mg_cam_ () at micro_mg_cam.F90:1895
#10 0x000000000143a6c8 in microp_driver_tend$microp_driver_ () at microp_driver.F90:193
#11 0x000000000155f9a0 in tphysbc$physpkg_ () at physpkg.F90:2621
#12 0x0000000001551207 in phys_run1$physpkg.cray$mt$p0001 () at physpkg.F90:1075
#13 0x000015554d67aee7 in _cray$mt_start_one_code_parallel () from /opt/cray/pe/cce/14.0.0/cce/x86_64/lib/libcraymp.so.1
#14 0x000015554d67cec5 in __cray$mt_execute_parallel_with_proc_bind () from /opt/cray/pe/cce/14.0.0/cce/x86_64/lib/libcraymp.so.1
#15 0x000015554d67d1e0 in _cray$mt_execute_parallel () from /opt/cray/pe/cce/14.0.0/cce/x86_64/lib/libcraymp.so.1
#16 0x00000000015508d7 in phys_run1$physpkg_ () at physpkg.F90:1060
#17 0x0000000000610c22 in cam_run1$cam_comp_ () at cam_comp.F90:258
#18 0x00000000005fa977 in atm_init_mct$atm_comp_mct_ () at atm_comp_mct.F90:407
#19 0x00000000004525f1 in component_init_cc$component_mod_ () at component_mod.F90:190
#20 0x000000000042f306 in cime_init$cime_comp_mod_ () at cime_comp_mod.F90:1398
#21 0x00000000004507ca in main () at cime_driver.F90:122

in some runs and in others

(gdb) bt
#0  0x000015554c2e0ac1 in __run_exit_handlers () from /lib64/libc.so.6
#1  0x000015554c2e0c9a in exit () from /lib64/libc.so.6
#2  0x000015554b843180 in PMI2_Abort (flag=<optimized out>, error_msg=<optimized out>) at /home/jenkins/src/api/misc/pmi_abort.c:73
#3  0x00001555541e7c32 in MPID_Abort () from /opt/cray/pe/lib64/libmpi_cray.so.12
#4  0x00001555528f21c8 in PMPI_Abort () from /opt/cray/pe/lib64/libmpi_cray.so.12
#5  0x0000155551a8422d in pmpi_abort__ () from /opt/cray/pe/lib64/libmpifort_cray.so.12
#6  0x0000000004c3983f in shr_mpi_abort$shr_mpi_mod_ () at shr_mpi_mod.F90:2127
#7  0x0000000004b3b62f in shr_abort_abort$shr_abort_mod_ () at shr_abort_mod.F90:38
#8  0x0000000001443ecf in mgfieldpostproc_accumulate$micro_mg_data_ () at micro_mg_data.F90:372
#9  0x000000000144c2c1 in mgpostproc_accumulate$micro_mg_data_ () at micro_mg_data.F90:518
#10 0x0000000001414f5a in micro_mg_cam_tend$micro_mg_cam_ () at micro_mg_cam.F90:2257
#11 0x0000000001474840 in microp_driver_tend$microp_driver_ () at microp_driver.F90:193
#12 0x00000000015a56b6 in tphysbc$physpkg_ () at physpkg.F90:2621
#13 0x0000000001596813 in phys_run1$physpkg.cray$mt$p0001 () at physpkg.F90:1075
#14 0x000015554d67aee7 in _cray$mt_start_one_code_parallel () from /opt/cray/pe/cce/14.0.0/cce/x86_64/lib/libcraymp.so.1
#15 0x000015554d67cec5 in __cray$mt_execute_parallel_with_proc_bind () from /opt/cray/pe/cce/14.0.0/cce/x86_64/lib/libcraymp.so.1
#16 0x000015554d67d1e0 in _cray$mt_execute_parallel () from /opt/cray/pe/cce/14.0.0/cce/x86_64/lib/libcraymp.so.1
#17 0x0000000001595f18 in phys_run1$physpkg_ () at physpkg.F90:1060
#18 0x000000000063368d in cam_run1$cam_comp_ () at cam_comp.F90:258
#19 0x000000000061adc3 in atm_init_mct$atm_comp_mct_ () at atm_comp_mct.F90:407
#20 0x000000000045ac36 in component_init_cc$component_mod_ () at component_mod.F90:190
#21 0x0000000000433262 in cime_init$cime_comp_mod_ () at cime_comp_mod.F90:1398
#22 0x000000000045859b in main () at cime_driver.F90:122

Building micro_mg_data.F90 with -hipa0 -hzero -O0 -hvector0 is not helping.

Tagging @grnydawn @sarats @abbotts @mattdturner

xyuan commented 2 years ago

This is threading problem, test runs without using threading passed without any problem, which means there is no problem in the code.

grnydawn commented 2 years ago

@amametjanov I could reproduce this error. Let me know if you already filed this issue at Cray. Otherwise, I will create one through OLCF help desk.

amametjanov commented 2 years ago

@grnydawn Not yet, please do. Thanks.

grnydawn commented 1 year ago

It seems that the direct cause of this issue is related to un-initialized variable on a particular thread. Someone who knows well this code may need to look at this issue.

The value of "self%accum_method" in the following code should be either accum_null(0) or accum_mean(1), but the value of the variable was "-1" (un-initialized value) on a certain thread.

In "E3SM/components/eam/src/physics/cam/micro_mg_data.F90"

subroutine MGFieldPostProc_accumulate(self) class(MGFieldPostProc), intent(inout) :: self

select case (self%accum_method) case (accum_null) ... case (accum_mean) ... case default call shr_sys_abort(errMsg(FILE, LINE) // & " Unrecognized MGFieldPostProc accumulation method.") end select end select

sarats commented 1 year ago

@singhbalwinder and @wlin7 Who would be the right contact to look at fixing this threading issue/fix in micro_mg_data.F90?

ndkeen commented 1 year ago

It may not be related, but I had been tracking down a problem on a different machine with runtime errors in this same source file. I had tests that failed, but only in DEBUG and only with threads in ATM. A work-around in my case was to change the flavor of fortran ASSERT to avoid the need for a temporary error mesg string.

Actually, re-reading the original comment, I see your error happens in NON-DEBUG builds, so this is surely not the issue.

https://github.com/E3SM-Project/E3SM/issues/5408

quantheory commented 1 year ago

@singhbalwinder mentioned this bug to me today. I wrote this module when I worked at NCAR about a decade ago; at that point it was part of an experiment in writing more object-oriented/generic codes using Fortran 2003 features and preprocessing, e.g. to create and use containers kind of like those in the C++ standard library. In some cases that worked out OK, but micro_mg_data is probably the least popular piece of code I've ever written. I hear that CESM removed it entirely in recent years.

One big problem with the module is that not all compilers have implemented Fortran 2003 completely or correctly, and in particular I don't think we were supporting the Cray compiler at all when this code was written. Also, I think that the unit test suites that were originally used to check for such compiler issues never made it into E3SM at all, possibly because they only covered a handful of modules like this one, and relied on a particular version of pFUnit that no one wanted to have as a dependency.

Anyway, I don't know for sure what is happening here, but my guess is one of two things:

  1. The post_proc variable defined here is for some reason being given the save attribute, which means that OpenMP treats it as a shared variable. In that case, you can try a couple of fixes: a) Add a directive like !$omp threadprivate(post_proc) where the variable is declared, which would be the easiest thing to try. b) Remove all default initializations from the micro_mg_data types, since these may be causing the compiler to incorrectly infer that the save attribute applies to every instance of these types. So in particular, you should remove all the default values (including null() pointer initializations) from these lines, and probably also the zero size initialization from the container "template" file here.
  2. There is a problem when copying some of the types. I don't think this is the issue because I don't see why it would cause problems only with threaded runs specifically, but it has been an issue for many compilers in the past. One possible workaround would be to use preprocessor definitions to remove the block here, similar to what's already done when the SUMMITDEV_PGI is defined in the preprocessor.
rljacob commented 1 year ago

I think the plan is to replace this with P3. So maybe we will eventually just delete all this micromg code? I'll ask.

quantheory commented 1 year ago

@rljacob That would also work fine if you only want to run P3 on this machine/compiler. It will still be a problem if you want to run v2 though, or if you wanted to get OpenMP tests working on master before P3 is actually merged.

I'm assuming that none of the MG code was ever planned to be in v4 anyway. The question at this point is just whether it's worth the effort to remove the code before the switch to EAMxx makes it a moot point anyway.

rljacob commented 1 year ago

Yes it would be worth fixing at least for the maint-2.x (and maint-1.x) branches. Short term, we'll only be running P3 cases on crusher/Frontier (SCREAM or MMF) so this doesn't need to work there except to get our test suite passing.