ESCOMP / CAM

Community Atmosphere Model
74 stars 136 forks source link

Bug in FSCAM with GNU compilers in DEBUG mode #257

Open briandobbins opened 3 years ago

briandobbins commented 3 years ago

Running SCAM with the GNU compilers with DEBUG=TRUE results in an error in CESM 2.2, but works in CESM 2.1.3.

Tested on Cheyenne:

CESM 2.1.3, Intel compiler, DEBUG=FALSE - works fine CESM 2.1.3, Intel compiler, DEBUG=TRUE - works fine CESM 2.1.3, GNU compiler, DEBUG=FALSE - works fine CESM 2.1.3, GNU compiler, DEBUG=TRUE - works fine

CESM 2.2.0, Intel compiler, DEBUG=FALSE - works fine CESM 2.2.0, Intel compiler, DEBUG=TRUE - works fine CESM 2.2.0, GNU compiler, DEBUG=FALSE - works fine CESM 2.2.0, GNU compiler, DEBUG=TRUE - fail, with the message below

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7efc67dc994f in ???
#1  0xf47df0 in __micro_mg3_0_MOD_micro_mg_tend
    at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/physics/pumas/micro_mg3_0.F90:1969
#2  0xc531c9 in micro_mg_cam_tend_pack
    at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/physics/cam/micro_mg_cam.F90:2517
#3  0xc710dc in __micro_mg_cam_MOD_micro_mg_cam_tend
    at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/physics/cam/micro_mg_cam.F90:1310
#4  0xc98a91 in __microp_driver_MOD_microp_driver_tend
    at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/physics/cam/microp_driver.F90:189
#5  0x662a1d in tphysbc
    at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/physics/cam/physpkg.F90:2473
#6  0x670449 in __physpkg_MOD_phys_run1
    at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/physics/cam/physpkg.F90:1073
#7  0x4fa1b3 in __cam_comp_MOD_cam_run1
    at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/control/cam_comp.F90:259
#8  0x4f4465 in __atm_comp_mct_MOD_atm_init_mct
    at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/cpl/mct/atm_comp_mct.F90:354
#9  0x427e14 in __component_mod_MOD_component_init_cc
    at /glade/scratch/bdobbins/scam/cesm2.2.0/cime/src/drivers/mct/main/component_mod.F90:248
#10  0x41e9a6 in __cime_comp_mod_MOD_cime_init
    at /glade/scratch/bdobbins/scam/cesm2.2.0/cime/src/drivers/mct/main/cime_comp_mod.F90:2209
#11  0x4243b5 in cime_driver
    at /glade/scratch/bdobbins/scam/cesm2.2.0/cime/src/drivers/mct/main/cime_driver.F90:122
#12  0x424524 in main
    at /glade/scratch/bdobbins/scam/cesm2.2.0/cime/src/drivers/mct/main/cime_driver.F90:23

To reproduce the failure in the CESM 2.2 release, do:

_export CESM22ROOT=<path to CESM 2.2 checkout> ${CESM22ROOT}/cime/scripts/create_newcase --compset FSCAM --res T42_T42 --compiler gnu --case foo --user-mods-dir ${CESM22ROOT}/components/cam/cime_config/usermods_dirs/scam_arm97 --run-unsupported cd foo ./xmlchange DEBUG=TRUE,PIO_TYPENAME=netcdf,STOP_N=1,STOPOPTION=ndays ./case.setup ./case.build ./case.submit

I've not tested other IOPs, just arm97. I'm going to dig into this at some point, but I'm not familiar with the SCAM code base, so I thought others might have a quick solution or at least ideas.

cacraigucar commented 3 years ago

@jtruesdal @Katetc This is an error in MG3 using SCAM. I've assigned both of you since I'm not sure which code base is the one responsible for the error.

Katetc commented 3 years ago

That's a great stack trace. It points to this line in MG3:

       if (lamr(i,k) > qsmall .and. 1._r8/lamr(i,k) < Dcs) then

Which is probably the same issue as Steve has added to the PUMAS repo here: https://github.com/ESCOMP/PUMAS/issues/8 "Invalid code logic tripping up some compilers"

So, we are aware of the general issue in PUMAS, and glad to have a simple case that reproduces the problem here! Also tagging @andrewgettelman .

Katetc commented 3 years ago

Also, you can leave me as the main assignee. I'll fix this and add a test for it going forward when we tackle the PUMAS issue.

andrewgettelman commented 3 years ago

Thanks Brian! I mentioned this to Hugh as well.

I'm happy to try to help fix this if needed. So it's ever .and. and .or. conditional? Or just those that might trigger a divide by zero error?

briandobbins commented 3 years ago

FYI, the fix Kate mentions works for this case.

Do we want to make a PR specifically for this, or allow the larger PUMAS issue to tackle it?

cacraigucar commented 3 years ago

Thanks Brian! I mentioned this to Hugh as well.

I'm happy to try to help fix this if needed. So it's ever .and. and .or. conditional? Or just those that might trigger a divide by zero error?

The way to figure out whether the .and. or .or. needs to be split is to look at each section and see if it can always be evaluated independently without any other section. If not, then it needs to be contained in its own if statement with an outer if statement to eliminate the invalid condition(s).

andrewgettelman commented 3 years ago

Tagging @hmorrison100 on this as well so he sees it.

gold2718 commented 3 years ago

I believe that the PUMAS issue for this is ESCOMP/PUMAS#8. Keeping this issue open so that when the fix is tagged in PUMAS, we can update the Externals_CAM.cfg file.

cacraigucar commented 1 year ago

@katec - Has this issue been addressed and should it be closed?

cacraigucar commented 1 year ago

@katec - Has this issue been addressed and should it be closed?

cacraigucar commented 1 year ago

@katec - Has this issue been addressed and should it be closed?

Katetc commented 1 year ago

Yes, it was fixed in pumas tag pumas_cam-release_v1.13 and cam tag 6_3_017.

cacraigucar commented 4 months ago

@Katetc - We are revisiting this, and I see that the original question says that it was an error in the cesm2_2 branch. I see that that branch is using puams_cam-releasev1.3, so it probably isn't fixed for that branch. Should it be and if so, can we just jump to v1.13 or will it require some work from someone to take that big a leap with pumas?