CESM-Development / water-isotopes

bug tracker for cesm water isotopes
Other
0 stars 0 forks source link

clm wiso doesn't run under pgi #6

Closed bandre-ucar closed 7 years ago

bandre-ucar commented 7 years ago

Summary of Issue:

Added an SMS_D clm4 wiso test to the clm wiso test suite. The intel and gnu versions run to completion. The pgi version dies at runtime shortly after initialization completes. There no useful messages in the logs, and no core file.

Expected behavior and actual behavior:

clm4 wiso should run under pgi.

Steps to reproduce the problem (should include create_newcase or create_test command along with any user_nl or xml changes):

./create_test -testname SMS_D_Ld3.f10_f10.ICLM40WISO.yellowstone_pgi.clm-40default
# then build and run

What is the changeset ID of the code, and the machine you are using:

clm-betr git repo, changeset: 040b12a

have you modified the code? If so, it must be committed and available for testing:

no

bandre-ucar commented 7 years ago

This is probably an array indexing bug or floating point issue that will affect clm5 as well.

bandre-ucar commented 7 years ago

Trying to debug with DDT failed because of a problem with yellowstone: Yellowstone debugging packages temporarily disabled:

Yellowstone debugging packages temporarily disabled
Issues with the GNU debugger (GDB) have resulted in a number of hung user processes on both
batch and login nodes, and CISL has temporarily disabled debugging and profiling packages as a 
result. These include Allinea DDT, MAP, and others, because they use the same broken kernel
functionality. Users also are asked to not run other debugging software until a patch can be
applied during the next Yellowstone maintenance downtime.

Users who try to load modules that have been disabled will receive the following error message:

“Lmod Error: Due to a bug in the ptrace() command, running this program currently results in user
processes which cannot be terminated. Therefore, we have disabled its usage until a patch can be
applied. Thank you for your patience.”
bandre-ucar commented 7 years ago

Tried this test again after fixing a few indexing bugs during clm5 development. This time a core file was generated indicating a signal handler was called at: HydrologyTracer.F90:2675

This is in the BalanceCheck_wiso routine, totice(2)=totice(2)+wtr_h2osoi_ice(c,j,1)

I don't see any indexing issues with a quick look at the variables. So this is probably a floating point exception...?

bandre-ucar commented 7 years ago

This is also a problem with clm5. The run dump core files for process 2 and 14, it looks like the offending line is around:

WaterIsotopesMod.F90:2727 WaterIsotopesMod.F90:2728

                   write(iulog,*) j,'snoliq=',h2osoi_liq(c,j),wtr_h2osoi_liq(c,j,m)
                   write(iulog,*) j,'snoice=',h2osoi_ice(c,j),wtr_h2osoi_ice(c,j,m)

From the cesm log file:

   2: error - BalanceCheck: Tracer water balance
   2: snl=                       -4
   2:           -3 snoliq=    0.000000000000000         0.000000000000000     
   2:           -3 snoice=    5.075054739198060     INFO: 0031-251  task 2 exited: rc=-8

This is consistent with the clm4 error described in a previous comment. It looks like the issue is invalid memory access in wtr_h2osoi_ice....

bandre-ucar commented 7 years ago

It looks like clm5 and clm4 potentially have different errors. I have clm5 running, but clm4 is not. clm4 still dies around HydrologyTracer.F90:2675

This is in the BalanceCheck_wiso routine, totice(2)=totice(2)+wtr_h2osoi_ice(c,j,1)

Adding print statements, it looks like totice(2) is blowing up and overflowing? At least it's becoming order 1e+200. This seems to be because wtr_h2osoi_ice has values on the order of 1.0e2, 1.0e3, 1.0e37. Maybe it's an issue with a bad filter...?

Since this is a) debugging output, and b) I'm not convinced it is actually doing the correct thing, I'm disabling for now so we can add a clm4 test to the wiso test suite.

bandre-ucar commented 7 years ago

New clm4 wiso test runs under pgi in git changeset . Marking this as resolved.