Closed bandre-ucar closed 7 years ago
This is probably an array indexing bug or floating point issue that will affect clm5 as well.
Trying to debug with DDT failed because of a problem with yellowstone: Yellowstone debugging packages temporarily disabled:
Yellowstone debugging packages temporarily disabled
Issues with the GNU debugger (GDB) have resulted in a number of hung user processes on both
batch and login nodes, and CISL has temporarily disabled debugging and profiling packages as a
result. These include Allinea DDT, MAP, and others, because they use the same broken kernel
functionality. Users also are asked to not run other debugging software until a patch can be
applied during the next Yellowstone maintenance downtime.
Users who try to load modules that have been disabled will receive the following error message:
“Lmod Error: Due to a bug in the ptrace() command, running this program currently results in user
processes which cannot be terminated. Therefore, we have disabled its usage until a patch can be
applied. Thank you for your patience.”
Tried this test again after fixing a few indexing bugs during clm5 development. This time a core file was generated indicating a signal handler was called at: HydrologyTracer.F90:2675
This is in the BalanceCheck_wiso routine, totice(2)=totice(2)+wtr_h2osoi_ice(c,j,1)
I don't see any indexing issues with a quick look at the variables. So this is probably a floating point exception...?
This is also a problem with clm5. The run dump core files for process 2 and 14, it looks like the offending line is around:
WaterIsotopesMod.F90:2727 WaterIsotopesMod.F90:2728
write(iulog,*) j,'snoliq=',h2osoi_liq(c,j),wtr_h2osoi_liq(c,j,m)
write(iulog,*) j,'snoice=',h2osoi_ice(c,j),wtr_h2osoi_ice(c,j,m)
From the cesm log file:
2: error - BalanceCheck: Tracer water balance
2: snl= -4
2: -3 snoliq= 0.000000000000000 0.000000000000000
2: -3 snoice= 5.075054739198060 INFO: 0031-251 task 2 exited: rc=-8
This is consistent with the clm4 error described in a previous comment. It looks like the issue is invalid memory access in wtr_h2osoi_ice....
It looks like clm5 and clm4 potentially have different errors. I have clm5 running, but clm4 is not. clm4 still dies around HydrologyTracer.F90:2675
This is in the BalanceCheck_wiso routine, totice(2)=totice(2)+wtr_h2osoi_ice(c,j,1)
Adding print statements, it looks like totice(2) is blowing up and overflowing? At least it's becoming order 1e+200. This seems to be because wtr_h2osoi_ice has values on the order of 1.0e2, 1.0e3, 1.0e37. Maybe it's an issue with a bad filter...?
Since this is a) debugging output, and b) I'm not convinced it is actually doing the correct thing, I'm disabling for now so we can add a clm4 test to the wiso test suite.
New clm4 wiso test runs under pgi in git changeset . Marking this as resolved.
Summary of Issue:
Added an SMS_D clm4 wiso test to the clm wiso test suite. The intel and gnu versions run to completion. The pgi version dies at runtime shortly after initialization completes. There no useful messages in the logs, and no core file.
Expected behavior and actual behavior:
clm4 wiso should run under pgi.
Steps to reproduce the problem (should include create_newcase or create_test command along with any user_nl or xml changes):
What is the changeset ID of the code, and the machine you are using:
clm-betr git repo, changeset: 040b12a
have you modified the code? If so, it must be committed and available for testing:
no