ai2cm / fv3gfs-fortran

FV3GFS Fortran for internal development at AI2
Other
6 stars 9 forks source link

Building the fortran model with call_py_fort in debug mode leads to crashes #365

Open spencerkclark opened 1 year ago

spencerkclark commented 1 year ago

I was exploring the possibility of addressing #340 (now that we are removing the serialize tests in #364 we might as well explore eliminating testing in docker entirely). This requires running tests in debug mode in the nix environment. In doing so I came across the fact that the basic native regression tests crash due to call_py_fort-related code: https://github.com/ai2cm/fv3gfs-fortran/blob/cb106f2eb806e8c635d28d8b76ee8e80a0e20bc3/FV3/atmos_model.F90#L463

A workaround would be to build the model without call_py_fort in debug mode to exercise this functionality, but ideally these tests would not crash in debug mode even when the model is built with call_py_fort active.

A basic way to reproduce this is to copy the configure.fv3.nix file into a new file within FV3/conf, set DEBUG=Y and REPRO= within it, configure/build the model, and run the tests:

$ cp FV3/conf/configure.fv3.nix FV3/conf/configure.fv3.nix_debug

    <edit configure.fv3.nix_debug>

$ cd FV3
$ configure nix_debug
$ cd ..
$ make build_native
$ pytest -vv -k default --native tests/pytest/test_regression.py

The traceback for one of the failing tests can be found below:

``` ===================================================================== FAILURES ====================================================================== _____________________________________________________ test_regression_native[Linux-default.yml] _____________________________________________________ run_native = .run_native at 0x7f080f93c3a0>, config_filename = 'default.yml' tmpdir = local('/tmp/pytest-of-spencerc/pytest-0/test_regression_native_Linux_d0') system_regtest = @pytest.mark.parametrize( "config_filename", [ pytest.param("default.yml", marks=pytest.mark.basic), pytest.param("model-level-coarse-graining.yml", marks=pytest.mark.coarse), pytest.param("pressure-level-coarse-graining.yml", marks=pytest.mark.coarse), "baroclinic.yml", "restart.yml", ], ) def test_regression_native(run_native, config_filename: str, tmpdir, system_regtest): config = get_config(config_filename) rundir = tmpdir.join("rundir") > run_native(config, str(rundir)) test_regression.py:123: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ config = {'data_table': 'default', 'diag_table': default 2000 1 1 0 0 0 "atmos_static", -1, "hours", 1, "hours", "time" "atmos... "all", "none", "none", 2 , 'experiment_name': 'default', 'forcing': 'gs://vcm-fv3config/data/base_forcing/v1.1/', ...} run_dir = '/tmp/pytest-of-spencerc/pytest-0/test_regression_native_Linux_d0/rundir', error_expected = False def run_native(config, run_dir: str, error_expected=False): fv3config.write_run_directory(config, run_dir) completed_process = subprocess.run( ["mpirun", "-n", "6", exe.absolute().as_posix()], cwd=run_dir, capture_output=True, ) if completed_process.returncode != 0 and not error_expected: print("Tail of Stderr:") print(completed_process.stderr[-2000:].decode()) print("Tail of Stdout:") print(completed_process.stdout[-2000:].decode()) > pytest.fail() E Failed conftest.py:77: Failed --------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------- Tail of Stderr: 0 shoc_cld= F uni_cld= F ntot3d= 1 ntot2d= 1 shocaftcnv= F indcld= -1 shoc_parm= 7000.0000000000000 1.0000000000000000 4.2857143000000004 0.69999999999999996 -999.00000000000000 ncnvw= -999 ncnvc= -999 resetting Model%frac_grid= F Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: #0 0x7f6c45875b90 in ??? #1 0x7f6c45874dc5 in ??? #2 0x7f6c3b54b39f in ??? #3 0x7f6c126490cf in ??? #4 0x7f6c3b20d26a in ??? #5 0x7f6c3b1fdb38 in ??? #6 0x7f6c3b1f7ab2 in ??? #7 0x7f6c3b1f8dac in ??? #8 0x7f6c3b1ffe03 in ??? #9 0x7f6c3b3060be in ??? #10 0x7f6c3b30645d in ??? #11 0x7f6c3b30648a in ??? #12 0x7f6c3b302cc8 in ??? #13 0x7f6c3b26c372 in ??? #14 0x7f6c3b22819e in ??? #15 0x7f6c3b200d67 in ??? #16 0x7f6c3b3060be in ??? #17 0x7f6c3b2260e1 in ??? #18 0x7f6c3b1f8dac in ??? #19 0x7f6c3b1ffe03 in ??? #20 0x7f6c3b1f7ab2 in ??? #21 0x7f6c3b1f8dac in ??? #22 0x7f6c3b1fce8b in ??? #23 0x7f6c3b1f7ab2 in ??? #24 0x7f6c3b1f8dac in ??? #25 0x7f6c3b1fc326 in ??? #26 0x7f6c3b1f7ab2 in ??? #27 0x7f6c3b1f8dac in ??? #28 0x7f6c3b1fc326 in ??? #29 0x7f6c3b1f7ab2 in ??? #30 0x7f6c3b2266d4 in ??? #31 0x7f6c3b226a4b in ??? #32 0x7f6c3b32c20e in ??? #33 0x7f6c3b1ff9d5 in ??? #34 0x7f6c3b3060be in ??? #35 0x7f6c3b30645d in ??? #36 0x7f6c3b30648a in ??? #37 0x7f6c45aebd84 in ??? #38 0x7f6c45aec06f in ??? #39 0x7f6c45aeba8d in ??? #40 0x7f6c45aeb466 in ??? #41 0x43c099 in __atmos_model_mod_MOD_update_atmos_physics at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/atmos_model.F90:463 #42 0x4431b6 in __atmos_model_mod_MOD_update_atmos_radiation_physics at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/atmos_model.F90:280 #43 0x476877 in coupler_main at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/coupler_main.F90:192 #44 0x47964c in main at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/coupler_main.F90:35 Tail of Stdout: de= 0 (0=count only, 1=replace) performing qc of albm mode= 0 (0=count only, 1=replace) performing qc of zorm mode= 0 (0=count only, 1=replace) performing qc of stc1m mode= 0 (0=count only, 1=replace) performing qc of stc2m mode= 0 (0=count only, 1=replace) performing qc of stc3m mode= 0 (0=count only, 1=replace) performing qc of stc4m mode= 0 (0=count only, 1=replace) performing qc of smc1m mode= 0 (0=count only, 1=replace) performing qc of smc2m mode= 0 (0=count only, 1=replace) performing qc of smc3m mode= 0 (0=count only, 1=replace) performing qc of smc4m mode= 0 (0=count only, 1=replace) performing qc of vegm mode= 1 (0=count only, 1=replace) performing qc of vetm mode= 1 (0=count only, 1=replace) performing qc of sotm mode= 1 (0=count only, 1=replace) performing qc of sihm mode= 1 (0=count only, 1=replace) performing qc of sicm mode= 1 (0=count only, 1=replace) performing qc of vmnm mode= 1 (0=count only, 1=replace) performing qc of vmxm mode= 1 (0=count only, 1=replace) performing qc of slpm mode= 1 (0=count only, 1=replace) performing qc of absm mode= 1 (0=count only, 1=replace) ============== final results ============== dbgx --fixratio: F F F F =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 5415 RUNNING AT spencer-vm = EXIT CODE: 9 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9) This typically refers to a problem with your application. Please see the FAQ page for debugging suggestions ```

spencerkclark commented 1 year ago

The particular flag that leads to errors is -ffpe-trap=invalid,zero,overflow; if we change it to -ffpe-trap=invalid,zero then the errors go away, so the underlying issue is apparently some kind of overflow.