E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
80 stars 57 forks source link

shoc_tests fails with certain rng seeds #1187

Closed bartgol closed 3 years ago

bartgol commented 3 years ago

Reproducer:

 ./shoc_tests --rng-seed 620239026

Relevant output:

 -------------------------------------------------------------------------------
pblintd_bfb
-------------------------------------------------------------------------------
/ascldap/users/lbertag/workdir/scream/scream-src/branch/components/scream/src/physics/shoc/tests/shoc_pblintd_tests.cpp:208
...............................................................................

/ascldap/users/lbertag/workdir/scream/scream-src/branch/components/scream/src/physics/shoc/tests/shoc_pblintd_tests.cpp:208: FAILED:
due to a fatal error condition:
  SIGFPE - Floating point error signal

P.s: This is why it's great that we print our rng seed: whenever we hit a bad case, we can take the rng seed from the output and re-create the same random inputs.

bartgol commented 3 years ago

Note: there are fails that are "ok", meaning that we might have a way to generate inputs that once in a million gives bad inputs, but fixing it would be too complicated, and we might accept the 1 in a 1M fail. If that's the case, then this is "fine". But we should diagnose it first.

tcclevenger commented 3 years ago

I think this is a "1 in a 1M fail" situation. It comes from dividing by 0 in the middle of the function, where the denominator is 1 - 1.5*0.31802/0.47703 = 0. This doesn't trigger in DP, only in SP. In the formula, 1 and 1.5 are constants, 0.47703 is a random input obklen for the test, and 0.31802 is the computed value pblh. I guess changing the range of the input obklen to be greater than 1 would ensure this never happens again, but I don't think we'll ever see it again anyways.

bartgol commented 3 years ago

Jeez, what are the odds... Are those numbers generated independently? I am talking about the 0.31802 and 0.47703. I just want to make sure we don't have one computed as a fcn of the other, in a way that can trigger this again.

If you feel confident this is just bad luck with random (and unphysical) inputs, then go ahead and close the issue.

Thanks for checking this!

tcclevenger commented 3 years ago

Yeah, it's kind of crazy. 0.47703 is a constant input and only used in that computation. 0.31802 is from a lot of other computation that doesn't involve the 0.47703 value at all.