E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
79 stars 56 forks source link

Floating divide by zero for DEBUG runs with pack size=1 on cori #1662

Closed ndkeen closed 2 years ago

ndkeen commented 2 years ago

Trying with pack size 1 and DEBUG on cori yielded quick divide by zeros.

Not much in stack.

123: forrtl: error (73): floating divide by zero
123: Image              PC                Routine            Line        Source             
...
123: e3sm.exe           00000000008987CF  Unknown               Unknown  Unknown
123: e3sm.exe           000000000046D82E  component_mod_mp_         257  component_mod.F90
123: e3sm.exe           0000000000415A91  cime_comp_mod_mp_        1438  cime_comp_mod.F90
123: e3sm.exe           00000000004642D5  MAIN__                    122  cime_driver.F90

/global/cscratch1/sd/ndk/e3sm_scratch/cori-knl/s53-may20/f30cpu.F2010-SCREAMv1.ne30_ne30.s53-may20.intel.24s.n011b16x1.s8.12sb.DEBUG.pack1
ambrad commented 2 years ago

@ndkeen I was just running on mappy and seeing probably the same thing. I was going to open an issue but will just add to this one if that's OK.

@bartgol this may be relevant to your PR #1656. I merged that PR into today's master, and I still get a /0.

Here are details on how to reproduce:

  1. pack size 1. I don't know how to do this through cime, so I just hard coded it.
  2. ne4_ne4 F2010-SCREAMv1
  3. DEBUG=true
  4. GNU compiler

Relevant part of stack trace:

/home/ambradl/SCREAM/components/scream/src/share/util/scream_common_physics_functions_impl.hpp:88
/home/ambradl/SCREAM/components/scream/src/diagnostics/potential_temperature.cpp:62

Print statement showing some data:

if (p_mid(icol,jpack)[0] == 0)
  fprintf(stderr,"amb> potential_temperature.cpp run_impl p_mid(icol,jpack) %d %d %e\n",icol,jpack,p_mid(icol,jpack)[0]);

yielding

amb> potential_temperature.cpp run_impl p_mid(icol,jpack) 0 0 0.000000e+00

Edit: Further printing, etc, with FPEs off shows that all of p_mid is 0 at intiialization but not during time stepping. Is it possible that the ne4 IC file is bad?

Edit: No, ncdump -v p_mid /sems-data-store/ACME/inputdata/atm/scream/init/init_ne4np4.nc looks good.

ndkeen commented 2 years ago

What I am doing to set pack size=1 is edit components/scream/cmake/machine-files/cori-knl.cmake

and add:

set(SCREAM_PACK_SIZE 1 CACHE STRING "")

ambrad commented 2 years ago

Thanks. But that's what I mean by "hard coding"; it modifies the repo state. The question is whether there is an xml/atmchange way of doing this so it's in one's run script rather than in a mod'ed repo.

ndkeen commented 2 years ago

Using a repo from May 19th, where the last change was 257a9d5dfb, I also see this same div-by-zero.

bartgol commented 2 years ago

Ah, p_mid is not read from the input file, since Homme is supposed to compute at run time, and since homme runs before anything else, the AD does not see p_mid as a requirement. We can put an easy quick fix, namely turn the filed in Homme from 'Computed' to 'Updated'. That will require to have p_mid in all input files, although from what I see this is probably already the case.

I'll take a look, and if I don't see a better solution, I will just change computet to updated.

I don't think we ever tried to change pack size for v1 cases. But ./xmlchange --append SCREAM_CMAKE_OPTIONS="SCREAM_PACK_SIZE 1" seems to work:

$ ./xmlquery SCREAM_CMAKE_OPTIONS
    SCREAM_CMAKE_OPTIONS: SCREAM_NP 4 SCREAM_NUM_VERTICAL_LEV 72 SCREAM_NUM_TRACERS 10
$ ./xmlchange --append SCREAM_CMAKE_OPTIONS="SCREAM_PACK_SIZE 1"
$ ./xmlquery SCREAM_CMAKE_OPTIONS
    SCREAM_CMAKE_OPTIONS: SCREAM_NP 4 SCREAM_NUM_VERTICAL_LEV 72 SCREAM_NUM_TRACERS 10 SCREAM_PACK_SIZE 1
bartgol commented 2 years ago

Actually, since this appears to happen only with packsize 1, it's probably not an IC issue, but, as pointed out before, a problem with the common phys funcitons implementation. I will focus on that first.

ambrad commented 2 years ago

I didn't check for pack size > 1. It might happen then, too.

bartgol commented 2 years ago

Well, we have our nightlies running packsize>1 on mappy, and they don't seem to pick up this error. That's why I speculated it was a packsize=1 issue.

ambrad commented 2 years ago

But with size > 1, FPE is off, right?

bartgol commented 2 years ago

Ah, right, this is an FPE thing. So yeah, I take it back, could be any pack size.

ndkeen commented 2 years ago

Oh good to know about that xmlchange option. I can look (or test) with recent repo versions if it helps to know when this might have started happening. I did just try with a repo from May 6th and I see same div-by-zero.

ndkeen commented 2 years ago

I don't see this error now (after #1669 was merged). Will close.

bartgol commented 2 years ago

I'm assuming you meant 1669?