I was exploring the possibility of addressing #340 (now that we are removing the serialize tests in #364 we might as well explore eliminating testing in docker entirely). This requires running tests in debug mode in the nix environment. In doing so I came across the fact that the basic native regression tests crash due to call_py_fort-related code: https://github.com/ai2cm/fv3gfs-fortran/blob/cb106f2eb806e8c635d28d8b76ee8e80a0e20bc3/FV3/atmos_model.F90#L463
A workaround would be to build the model without call_py_fort in debug mode to exercise this functionality, but ideally these tests would not crash in debug mode even when the model is built with call_py_fort active.
A basic way to reproduce this is to copy the configure.fv3.nix file into a new file within FV3/conf, set DEBUG=Y and REPRO= within it, configure/build the model, and run the tests:
$ cp FV3/conf/configure.fv3.nix FV3/conf/configure.fv3.nix_debug
<edit configure.fv3.nix_debug>
$ cd FV3
$ configure nix_debug
$ cd ..
$ make build_native
$ pytest -vv -k default --native tests/pytest/test_regression.py
The traceback for one of the failing tests can be found below:
```
===================================================================== FAILURES ======================================================================
_____________________________________________________ test_regression_native[Linux-default.yml] _____________________________________________________
run_native = .run_native at 0x7f080f93c3a0>, config_filename = 'default.yml'
tmpdir = local('/tmp/pytest-of-spencerc/pytest-0/test_regression_native_Linux_d0')
system_regtest =
@pytest.mark.parametrize(
"config_filename",
[
pytest.param("default.yml", marks=pytest.mark.basic),
pytest.param("model-level-coarse-graining.yml", marks=pytest.mark.coarse),
pytest.param("pressure-level-coarse-graining.yml", marks=pytest.mark.coarse),
"baroclinic.yml",
"restart.yml",
],
)
def test_regression_native(run_native, config_filename: str, tmpdir, system_regtest):
config = get_config(config_filename)
rundir = tmpdir.join("rundir")
> run_native(config, str(rundir))
test_regression.py:123:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
config = {'data_table': 'default', 'diag_table': default
2000 1 1 0 0 0
"atmos_static", -1, "hours", 1, "hours", "time"
"atmos... "all", "none", "none", 2
, 'experiment_name': 'default', 'forcing': 'gs://vcm-fv3config/data/base_forcing/v1.1/', ...}
run_dir = '/tmp/pytest-of-spencerc/pytest-0/test_regression_native_Linux_d0/rundir', error_expected = False
def run_native(config, run_dir: str, error_expected=False):
fv3config.write_run_directory(config, run_dir)
completed_process = subprocess.run(
["mpirun", "-n", "6", exe.absolute().as_posix()],
cwd=run_dir,
capture_output=True,
)
if completed_process.returncode != 0 and not error_expected:
print("Tail of Stderr:")
print(completed_process.stderr[-2000:].decode())
print("Tail of Stdout:")
print(completed_process.stdout[-2000:].decode())
> pytest.fail()
E Failed
conftest.py:77: Failed
--------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------
Tail of Stderr:
0 shoc_cld= F uni_cld= F ntot3d= 1 ntot2d= 1 shocaftcnv= F indcld= -1 shoc_parm= 7000.0000000000000 1.0000000000000000 4.2857143000000004 0.69999999999999996 -999.00000000000000 ncnvw= -999 ncnvc= -999
resetting Model%frac_grid= F
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7f6c45875b90 in ???
#1 0x7f6c45874dc5 in ???
#2 0x7f6c3b54b39f in ???
#3 0x7f6c126490cf in ???
#4 0x7f6c3b20d26a in ???
#5 0x7f6c3b1fdb38 in ???
#6 0x7f6c3b1f7ab2 in ???
#7 0x7f6c3b1f8dac in ???
#8 0x7f6c3b1ffe03 in ???
#9 0x7f6c3b3060be in ???
#10 0x7f6c3b30645d in ???
#11 0x7f6c3b30648a in ???
#12 0x7f6c3b302cc8 in ???
#13 0x7f6c3b26c372 in ???
#14 0x7f6c3b22819e in ???
#15 0x7f6c3b200d67 in ???
#16 0x7f6c3b3060be in ???
#17 0x7f6c3b2260e1 in ???
#18 0x7f6c3b1f8dac in ???
#19 0x7f6c3b1ffe03 in ???
#20 0x7f6c3b1f7ab2 in ???
#21 0x7f6c3b1f8dac in ???
#22 0x7f6c3b1fce8b in ???
#23 0x7f6c3b1f7ab2 in ???
#24 0x7f6c3b1f8dac in ???
#25 0x7f6c3b1fc326 in ???
#26 0x7f6c3b1f7ab2 in ???
#27 0x7f6c3b1f8dac in ???
#28 0x7f6c3b1fc326 in ???
#29 0x7f6c3b1f7ab2 in ???
#30 0x7f6c3b2266d4 in ???
#31 0x7f6c3b226a4b in ???
#32 0x7f6c3b32c20e in ???
#33 0x7f6c3b1ff9d5 in ???
#34 0x7f6c3b3060be in ???
#35 0x7f6c3b30645d in ???
#36 0x7f6c3b30648a in ???
#37 0x7f6c45aebd84 in ???
#38 0x7f6c45aec06f in ???
#39 0x7f6c45aeba8d in ???
#40 0x7f6c45aeb466 in ???
#41 0x43c099 in __atmos_model_mod_MOD_update_atmos_physics
at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/atmos_model.F90:463
#42 0x4431b6 in __atmos_model_mod_MOD_update_atmos_radiation_physics
at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/atmos_model.F90:280
#43 0x476877 in coupler_main
at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/coupler_main.F90:192
#44 0x47964c in main
at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/coupler_main.F90:35
Tail of Stdout:
de= 0 (0=count only, 1=replace)
performing qc of albm mode= 0 (0=count only, 1=replace)
performing qc of zorm mode= 0 (0=count only, 1=replace)
performing qc of stc1m mode= 0 (0=count only, 1=replace)
performing qc of stc2m mode= 0 (0=count only, 1=replace)
performing qc of stc3m mode= 0 (0=count only, 1=replace)
performing qc of stc4m mode= 0 (0=count only, 1=replace)
performing qc of smc1m mode= 0 (0=count only, 1=replace)
performing qc of smc2m mode= 0 (0=count only, 1=replace)
performing qc of smc3m mode= 0 (0=count only, 1=replace)
performing qc of smc4m mode= 0 (0=count only, 1=replace)
performing qc of vegm mode= 1 (0=count only, 1=replace)
performing qc of vetm mode= 1 (0=count only, 1=replace)
performing qc of sotm mode= 1 (0=count only, 1=replace)
performing qc of sihm mode= 1 (0=count only, 1=replace)
performing qc of sicm mode= 1 (0=count only, 1=replace)
performing qc of vmnm mode= 1 (0=count only, 1=replace)
performing qc of vmxm mode= 1 (0=count only, 1=replace)
performing qc of slpm mode= 1 (0=count only, 1=replace)
performing qc of absm mode= 1 (0=count only, 1=replace)
==============
final results
==============
dbgx --fixratio: F F F F
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 5415 RUNNING AT spencer-vm
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
```
The particular flag that leads to errors is -ffpe-trap=invalid,zero,overflow; if we change it to -ffpe-trap=invalid,zero then the errors go away, so the underlying issue is apparently some kind of overflow.
I was exploring the possibility of addressing #340 (now that we are removing the serialize tests in #364 we might as well explore eliminating testing in docker entirely). This requires running tests in debug mode in the nix environment. In doing so I came across the fact that the basic native regression tests crash due to call_py_fort-related code: https://github.com/ai2cm/fv3gfs-fortran/blob/cb106f2eb806e8c635d28d8b76ee8e80a0e20bc3/FV3/atmos_model.F90#L463
A workaround would be to build the model without call_py_fort in debug mode to exercise this functionality, but ideally these tests would not crash in debug mode even when the model is built with call_py_fort active.
A basic way to reproduce this is to copy the
configure.fv3.nix
file into a new file withinFV3/conf
, setDEBUG=Y
andREPRO=
within it, configure/build the model, and run the tests:The traceback for one of the failing tests can be found below:
``` ===================================================================== FAILURES ====================================================================== _____________________________________________________ test_regression_native[Linux-default.yml] _____________________________________________________ run_native =.run_native at 0x7f080f93c3a0>, config_filename = 'default.yml'
tmpdir = local('/tmp/pytest-of-spencerc/pytest-0/test_regression_native_Linux_d0')
system_regtest =
@pytest.mark.parametrize(
"config_filename",
[
pytest.param("default.yml", marks=pytest.mark.basic),
pytest.param("model-level-coarse-graining.yml", marks=pytest.mark.coarse),
pytest.param("pressure-level-coarse-graining.yml", marks=pytest.mark.coarse),
"baroclinic.yml",
"restart.yml",
],
)
def test_regression_native(run_native, config_filename: str, tmpdir, system_regtest):
config = get_config(config_filename)
rundir = tmpdir.join("rundir")
> run_native(config, str(rundir))
test_regression.py:123:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
config = {'data_table': 'default', 'diag_table': default
2000 1 1 0 0 0
"atmos_static", -1, "hours", 1, "hours", "time"
"atmos... "all", "none", "none", 2
, 'experiment_name': 'default', 'forcing': 'gs://vcm-fv3config/data/base_forcing/v1.1/', ...}
run_dir = '/tmp/pytest-of-spencerc/pytest-0/test_regression_native_Linux_d0/rundir', error_expected = False
def run_native(config, run_dir: str, error_expected=False):
fv3config.write_run_directory(config, run_dir)
completed_process = subprocess.run(
["mpirun", "-n", "6", exe.absolute().as_posix()],
cwd=run_dir,
capture_output=True,
)
if completed_process.returncode != 0 and not error_expected:
print("Tail of Stderr:")
print(completed_process.stderr[-2000:].decode())
print("Tail of Stdout:")
print(completed_process.stdout[-2000:].decode())
> pytest.fail()
E Failed
conftest.py:77: Failed
--------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------
Tail of Stderr:
0 shoc_cld= F uni_cld= F ntot3d= 1 ntot2d= 1 shocaftcnv= F indcld= -1 shoc_parm= 7000.0000000000000 1.0000000000000000 4.2857143000000004 0.69999999999999996 -999.00000000000000 ncnvw= -999 ncnvc= -999
resetting Model%frac_grid= F
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7f6c45875b90 in ???
#1 0x7f6c45874dc5 in ???
#2 0x7f6c3b54b39f in ???
#3 0x7f6c126490cf in ???
#4 0x7f6c3b20d26a in ???
#5 0x7f6c3b1fdb38 in ???
#6 0x7f6c3b1f7ab2 in ???
#7 0x7f6c3b1f8dac in ???
#8 0x7f6c3b1ffe03 in ???
#9 0x7f6c3b3060be in ???
#10 0x7f6c3b30645d in ???
#11 0x7f6c3b30648a in ???
#12 0x7f6c3b302cc8 in ???
#13 0x7f6c3b26c372 in ???
#14 0x7f6c3b22819e in ???
#15 0x7f6c3b200d67 in ???
#16 0x7f6c3b3060be in ???
#17 0x7f6c3b2260e1 in ???
#18 0x7f6c3b1f8dac in ???
#19 0x7f6c3b1ffe03 in ???
#20 0x7f6c3b1f7ab2 in ???
#21 0x7f6c3b1f8dac in ???
#22 0x7f6c3b1fce8b in ???
#23 0x7f6c3b1f7ab2 in ???
#24 0x7f6c3b1f8dac in ???
#25 0x7f6c3b1fc326 in ???
#26 0x7f6c3b1f7ab2 in ???
#27 0x7f6c3b1f8dac in ???
#28 0x7f6c3b1fc326 in ???
#29 0x7f6c3b1f7ab2 in ???
#30 0x7f6c3b2266d4 in ???
#31 0x7f6c3b226a4b in ???
#32 0x7f6c3b32c20e in ???
#33 0x7f6c3b1ff9d5 in ???
#34 0x7f6c3b3060be in ???
#35 0x7f6c3b30645d in ???
#36 0x7f6c3b30648a in ???
#37 0x7f6c45aebd84 in ???
#38 0x7f6c45aec06f in ???
#39 0x7f6c45aeba8d in ???
#40 0x7f6c45aeb466 in ???
#41 0x43c099 in __atmos_model_mod_MOD_update_atmos_physics
at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/atmos_model.F90:463
#42 0x4431b6 in __atmos_model_mod_MOD_update_atmos_radiation_physics
at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/atmos_model.F90:280
#43 0x476877 in coupler_main
at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/coupler_main.F90:192
#44 0x47964c in main
at /home/spencerc/2022-10-10/fv3gfs-fortran/FV3/coupler_main.F90:35
Tail of Stdout:
de= 0 (0=count only, 1=replace)
performing qc of albm mode= 0 (0=count only, 1=replace)
performing qc of zorm mode= 0 (0=count only, 1=replace)
performing qc of stc1m mode= 0 (0=count only, 1=replace)
performing qc of stc2m mode= 0 (0=count only, 1=replace)
performing qc of stc3m mode= 0 (0=count only, 1=replace)
performing qc of stc4m mode= 0 (0=count only, 1=replace)
performing qc of smc1m mode= 0 (0=count only, 1=replace)
performing qc of smc2m mode= 0 (0=count only, 1=replace)
performing qc of smc3m mode= 0 (0=count only, 1=replace)
performing qc of smc4m mode= 0 (0=count only, 1=replace)
performing qc of vegm mode= 1 (0=count only, 1=replace)
performing qc of vetm mode= 1 (0=count only, 1=replace)
performing qc of sotm mode= 1 (0=count only, 1=replace)
performing qc of sihm mode= 1 (0=count only, 1=replace)
performing qc of sicm mode= 1 (0=count only, 1=replace)
performing qc of vmnm mode= 1 (0=count only, 1=replace)
performing qc of vmxm mode= 1 (0=count only, 1=replace)
performing qc of slpm mode= 1 (0=count only, 1=replace)
performing qc of absm mode= 1 (0=count only, 1=replace)
==============
final results
==============
dbgx --fixratio: F F F F
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 5415 RUNNING AT spencer-vm
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
```