Closed StevePny closed 10 months ago
Hi, Steve. Thank you for identifying this issue. We occasionally see a crash in this location (in gfdl_mp.F90 in one of the lookup tables) and in the past I have thought it was due to numerical instability; but if it is not present in the Gnu compiler it strongly suggests something different is going on. The routine is set up so that it can never round to an integer less than 1 no matter how low the temperature is, but apparently in intel it is either rounding down to 0 in temperatures near t_min
, or it is getting a NaN and creating bad results.
Since it is crashing shortly after initialization, it may be worthwhile to turn on range_warn
and fv_debug
to help try to pinpoint at what step it is crashing. Could you try that and send along the output log?
Thanks, Lucas
Hi Lucas (@lharris4 ), I've run the intel build and a 10-minute integration of the gnu build, both with range_warn and fv_debug set to 'true'. I'm attaching the output from both.
Hi, Steve. I see immediately that the gfortran run was compiled double-precision but the intel run is single-precision. I would think that by itself this would not cause a crash, but it could be a signal of some other underlying issue. The debug output doesn't suggest anything suspicious.
Hi @lharris4, this runs ok for us with version 2023.05. We've tested builds using combinations of 32-bit, 64-bit, debug, prod, non-hydrostatic, hydrostatic. It seems to be running ok for us now.
@StevePny thanks, glad to hear it works now. Hope to get to your other issues this week.
Short issue:
We are getting the runtime error:
In more detail:
We are using the latest version of shield, i.e. SHiELD_BUILD_VERSION="FV3-202204-public", FV3_VERSION="FV3-202210-public", FMS_VERSION="2022.04". We're running on an ubuntu 22.04 linux AWS ec2 instance, and have built/run SHiELD successfully for many months using OpenMPI/gfortran.
We are now switching our build over from OpenMPI/gfortran (MKMF_TEMPLATE=linux-ubuntu-trusty-gnu.mk) to IntelMPI/ifort (MKMF_TEMPLATE="intel.mk"). We are using intel version:
Our build is based as closely as possible on this SHiELD_build repo. We're testing a 1-hour C96 simulation with our original OpenMPI/gfortran build, and it completes successfully (~300 seconds on 24 cores). With IntelMPI/ifort, the model builds successfully, but from the same experiment directory where the GNU build runs without error, the intel build gives the following error at runtime:
For reference the traceback is pointing to intermediate_phys.F90, line 257: https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/blob/d2e5bef344b64d6a10524479b3288717239fb2a2/model/intermediate_phys.F90#L257
I checked our build logs, and we are using both USE_COND and MOIST_CAPPA, which are activated due to the 'nh' setting.
I noticed this is called from: https://github.com/NOAA-GFDL/SHiELD_physics/blob/2882fdeb429abc2349a8e881803ac67b154532c3/simple_coupler/coupler_main.F90#L146C19-L146C19
As an additional piece of information, we have also generated our own control/coupler file, and do not have this runtime error with the intel build. In our case, we comment out fms_init and fms_affinity_init since fms_init is called here twice and fms_affinity_init was removed later in https://github.com/NOAA-GFDL/FMScoupler/blob/main/SHiELD/coupler_main.F90:
I've tried building the IntelMPI/ifort build in both a docker container and a bash script directly on the ec2 instance, and I've tried building in both 'prod' mode and 'debug' model, but all give the same error above.
I've tried removing "export FMS_CPPDEFS=-DHAVE_GETTID" from the build options - in that case the make FMS fails.
I found a similar issue report in E3SM due to an upgrade in the intel complier. In their case it was related to a bug, but I'm not sure if that is true here: https://github.com/E3SM-Project/E3SM/issues/2051
Have you seen this error before, and do you have any idea what might be causing it? I recall getting a similar error in Dec 2022 and I believe the FMS version was part of the problem, and it was resolved by upgrading FMS. However, the FMS versions are the same between builds in this case.