ESCOMP / SimpleLand

Simple Land Model for CESM --- *** IN DEVELOPMENT *** --- please contact for more info. See supplemental information of https://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-18-0812.1 for a description of SLIM physics. Implementation of SLIM into the main CESM trunk is ongoing. SLIM currently works with the CESM2.1 release, but must be downloaded from this repository until we finish implementing it properly into the main CESM code.
Other
14 stars 7 forks source link

global uniform I2000 case is finding a NaN in coupler output #17

Open ekluzek opened 3 years ago

ekluzek commented 3 years ago

The following test is crashing with an error about a NaN

SMS.f19_g17.I2000SlimRsGs.cheyenne_intel.clm-global_uniform

The output error is...

721: MML depth (negative) from surface of midpoint of each soil layer
721:
721:
721: MML the var we are trying is:
957:Image              PC                Routine            Line        Source
957:cesm.exe           000000000151B5AD  Unknown               Unknown  Unknown
957:cesm.exe           0000000000C9B2E2  shr_abort_mod_mp_         114  shr_abort_mod.F90
957:cesm.exe           00000000004F9057  abortutils_mp_end          43  abortutils.F90
957:cesm.exe           00000000004F855C  lnd_import_export         419  lnd_import_export.F90
957:cesm.exe           00000000004EE6D7  lnd_comp_mct_mp_l         482  lnd_comp_mct.F90
957:cesm.exe           00000000004249E4  component_mod_mp_         728  component_mod.F90
957:cesm.exe           000000000040A12D  cime_comp_mod_mp_        2712  cime_comp_mod.F90
957:cesm.exe           000000000042468C  MAIN__                    125  cime_driver.F90
957:cesm.exe           0000000000407E1E  Unknown               Unknown  Unknown
957:libc-2.22.so       00002B193D5046E5  __libc_start_main     Unknown  Unknown
957:cesm.exe           0000000000407D29  Unknown               Unknown  Unknown
957:MPT ERROR: Rank 957(g:957) is aborting with error code 1001.
957:    Process ID: 25969, Host: r3i2n1, Program: /glade/scratch/erik/SMS.f19_g17.I2000SlimRsGs.cheyenne_intel.clm-global_uniform.GC.slim-n1_cesm21chintelasl/bld/cesm.exe
957:    MPT Version: HPE MPT 2.19  02/23/19 05:30:09

And the line is:

call endrun( sub//' ERROR: One or more of the output from CLM to the coupler are NaN ' )

ekluzek commented 3 years ago

Running in DEBUG mode I find a problem with the following division...

          if ( snow(g) < 0.0_r8 ) then
               temp(g) = 0.0_r8
               write(iulog,*)'warning: snow<0, setting snowmasking factor to zero. (snow(g) = ',snow(g),', overwriting so snow(g)=0.0)'
               snow(g) = 0.0_r8
          else
               temp(g) = snow(g) / ( snow(g) + snowmask(g) )
          end if

It checks for snow(g) < zero, but not if (snow(g) +_ snowmask(g)) == zero. So it probably needs another check for that in the code.

@marysa does the above sound right to you? What do you think the best way to solve this divide by zero issue might be?

ekluzek commented 3 years ago

Here's the traceback from the cesm.log file:

1046:MPT: Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-49.16.x86_64
1046:MPT: (gdb) #0  0x00002b39ca0286da in waitpid ()
1046:MPT:    from /glade/u/apps/ch/os/lib64/libpthread.so.0
1046:MPT: #1  0x00002b39ca96fdb6 in mpi_sgi_system (
1046:MPT: #2  MPI_SGI_stacktraceback (
1046:MPT:     header=header@entry=0x7ffeade860c0 "MPT ERROR: Rank 1046(g:1046) received signal SIGFPE(8).\n\tProcess ID: 3297, Host: r4i4n19, Program: /glade/scratch/erik/SMS_D.f19_g17.I2000SlimRsGs.cheyenne_intel.clm-global_uniform.GC.slim-n1_cesm21ch"...) at sig.c:340
1046:MPT: #3  0x00002b39ca96ffb2 in first_arriver_handler (signo=signo@entry=8,
1046:MPT:     stack_trace_sem=stack_trace_sem@entry=0x2b39d4fc0080) at sig.c:489
1046:MPT: #4  0x00002b39ca97034b in slave_sig_handler (signo=8, siginfo=<optimized out>,
1046:MPT:     extra=<optimized out>) at sig.c:564
1046:MPT: #5  <signal handler called>
1046:MPT: #6  0x0000000000c1fccc in mml_mainmod::mml_main (bounds=..., atm2lnd_inst=...,
1046:MPT:     lnd2atm_inst=...) at /glade/work/erik/slim_cesm21/src/main/mml_main.F90:636

It doesn't explicitly say a divide by zero, but does say it's a floating point exception, so I'm assuming it's a divide by zero.

marysa commented 3 years ago

Oh! Yeah! That absolutely could be a problem and we SHOULD check to make sure we're not about to divide by zero! If snow(g)==0, temp(g) should = 0.

Here, temp(g) is the factor that is used to modify the surface albedo when there is snow. If there isn't much snow, temp(g) is small and more weight goes to bare-ground albedo, while if there is a lot of snow, temp(g) is large and the albedo looks more like snow albedo than bare-ground albedo. I probably though dividing by zero could never happen because mentally I would expect snowmask(g) (the number that controls how quickly snow makes the ground "look" like snow vs like bare ground) to never be zero, but there is nothing stopping anybody from setting snowmask(g)=0, and if that happened and there wasn't any snow on the ground, it would indeed be dividing by zero.

marysa commented 3 years ago

(also snowmask(g) should never be allowed to be negative, that could also result in weirdness)

ekluzek commented 3 years ago

I see this same problem with izumi_intel compiler as well. And izumi_nag test fails as well.

ekluzek commented 3 years ago

@marysa I checked in a fix for this in this commit...

6c8630b3c76e25e4737421e13978eea69734b930

Please look it over and make sure you approve.