Closed forsyth2 closed 1 year ago
I ran a representative set of the reproduction scripts. I did discover that the current iteration of patch_helper.py
creates two definitions of the CASE_NAME
variable and also sets PELAYOUT
and WALLTIME
incorrectly in the non-production run case. To run this representative set, I manually changed the reproduction scripts to fix these issues.
Passed: v2.LR.hist-GHG_0101
, v2.LR.amip_0101
, v2.LR.piClim-histall_0021
, v2.NARRM.piControl
, v2.NARRM.historical_0101
Remaining issues:
v2.LR.historical_0101
: e3sm.log.285769.230216-164230
675: MPI_ABORT was invoked on rank 675 in communicator MPI_COMM_WORLD
675: with errorcode 1001.
675:
675: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
675: You may or may not see output from other processes, depending on
675: exactly when Open MPI kills them.
v2.LR.historical_0201
: e3sm.log.285811.230216-200219
648: MPI_ABORT was invoked on rank 648 in communicator MPI_COMM_WORLD
648: with errorcode 1001.
648:
648: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
648: You may or may not see output from other processes, depending on
648: exactly when Open MPI kills them.
v2.LR.piClim-control
: run_tests.o286524
ERROR: Reference case directory /lcrc/group/e3sm/ac.forsyth2/E3SMv2_test/v2.LR.piClim-control/init does not exist or is not readable
v2.NARRM.amip_0101
: e3sm.log.289007.230222-182740
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
0: slurmstepd: error: *** STEP 289007.0 ON chr-0498 CANCELLED AT 2023-02-22T22:28:00 DUE TO TIME LIMIT ***
(even after increasing time from 00:20:00
to 04:00:00
).
There is now a large degree of automation to this process. This was necessary because ~40 reproduction scripts had to be created properly. Trying to do so manually is extremely prone to errors, notably:
We have to create ~40 reproduction scripts given a) the original scripts and b) the diff between the original piControl
and the reproduction piControl
scripts.
We use (1) generate_reproduction_script.sh
to generate a single reproduction script. First, we apply (2) diff_patch
(the piControl
diff) to the original script. This is, however, insufficient, as the diff_patch
is imperfect when applied to original scripts other than piControl
. We therefore use (3) patch_helper.py
to make more significant changes to the newly generated reproduction scripts. (3) goes through the reproduction script line by line, checking for inconsistencies that must be fixed.
I initially ran (1) manually for each case. As I made change after to change to how it works, it became necessary to re-generate the reproduction scripts in a consistent, fast manner. The script (4) update_reproduction_scripts.sh
handles that.
Now, we theoretically have finished reproduction run scripts. However, we should test at least a representative sample to make sure they work as expected. Running (5) test_reproduction_scripts.bash
will do just that. After checking that all jobs are finished -- a little more than 4 hours after (5) starts running -- we can check the results with (6) check_results.bash
That's 6 files to automate this process, thus ensuring accuracy and repeatability. Documentation on running these scripts can be found at utils/README.md
(https://github.com/E3SM-Project/e3sm_data_docs/tree/n23-reproduce-scripts/utils).
Output of latest check_results
run:
v2.LR.piControl
Line count test passed
Checksum test passed
v2.LR.historical_0101
gzip: XS_1x10_ndays/run/atm.log.*.gz: No such file or directory
Line count test failed
0 atm_XS_1x10_ndays.txt
482 atm_XS_1x10_ndays.txt
Checksum test failed
d41d8cd98f00b204e9800998ecf8427e atm_XS_1x10_ndays.txt
61a7f492bdcc6e6cd4a2b41c92546219 atm_XS_1x10_ndays.txt
v2.LR.hist-GHG_0101
Line count test passed
Checksum test passed
v2.LR.amip_0101
Line count test passed
Checksum test passed
v2.LR.piClim-control
Line count test passed
Checksum test passed
v2.LR.piClim-histall_0021
Line count test passed
Checksum test passed
v2.NARRM.piControl
Line count test passed
Checksum test passed
v2.NARRM.historical_0101
Line count test passed
Checksum test passed
v2.NARRM.amip_0101
gzip: XS_1x10_ndays/run/atm.log.*.gz: No such file or directory
Line count test failed
0 atm_XS_1x10_ndays.txt
482 atm_XS_1x10_ndays.txt
Checksum test failed
d41d8cd98f00b204e9800998ecf8427e atm_XS_1x10_ndays.txt
930b7fc7e946910c3c8e716f733d0f31 atm_XS_1x10_ndays.txt
v2.LR.historical_0101
still fails with the MPI errorv2.LR.historical_0201
out since it had the same MPI error as v2.LR.historical_0101
v2.LR.piClim-control
works now, due to the f_out.write(f'readonly RUN_REFDIR="/lcrc/group/e3sm/${{USER}}/E3SMv2_test/{init_case_name}/init"\n')
change in patch_helper.py
, which makes the reproduction script pick up the init
from the original script rather than assuming it uses its own init
v2.NARRM.amip_0101
still fails with the time limit error.@golaz I've automated a great deal of this process and made some necessary fixes (see the 2 comments above). I still have a couple remaining errors, which I'm looking into:
v2.LR.historical_0101
lnd.log.295142.230307-123432
shows
ERROR: Initial conditions file (finidat) was generated from a different surface
dataset
than the one being used for the current simulation (fsurdat).
Current fsurdat: surfdata_ne30pg2_simyr1850_c210402.nc
Surface dataset used to generate initial conditions file:
surfdata_ne30np4.pg2_simyr1850_c201210.nc
Possible solutions to this problem:
(1) Make sure you are using the correct surface dataset and initial conditions
file
(2) If you generated the surface dataset and/or initial conditions file yoursel
f,
then you may need to manually change the surface_dataset global attribute o
n the
initial conditions file (e.g., using ncatted)
(3) If you are confident that you are using the correct surface dataset and ini
tial conditions file,
yet are still experiencing this error, then you can bypass this check by se
tting:
check_finidat_fsurdat_consistency = .false.
in user_nl_elm
ENDRUN:
ERROR in restFileMod.F90 at line 1382
ERROR: Unknown error submitted to shr_abort_abort.
I recognized this error from when we ran the original simulations. Looking at an email from September 2021, possible solutions included:
# (1) Set
check_finidat_fsurdat_consistency = .false.
# or (2) set land surface file back
fsurdat = '/lcrc/group/e3sm/data/inputdata/lnd/clm2/surfdata_map/surfdata_ne30np4.pg2_simyr1850_c201210.nc'
I ran grep fsurdat run*.sh
in /home/ac.forsyth2/e3sm_data_docs/run_scripts/v2/original
and v2.LR.historical_0101
didn't show up. I wonder if check_finidat_fsurdat_consistency = .false.
is now a necessity that should be added to all reproduction scripts. I can try running this script with that change to check.
v2.NARRM.amip_0101
$ cd /lcrc/group/e3sm/ac.forsyth2/E3SMv2_test/v2.NARRM.amip_0101/tests/XS_1x10_ndays/run
$ grep "DATE" atm.log.295219.230307-155526
# 1870/01/01 - 1870/01/04.
It seems to only get through 4 days in 20 minutes. However, I tried setting the limit as high as 4 hours (see first comment) and it still failed. /home/ac.forsyth2/E3SMv2_test/scripts/run.v2.NARRM.amip_0101.sh
shows readonly run='XS_1x10_ndays'
, so it should only be running 10 days.
Looking in /home/ac.forsyth2/e3sm_data_docs/run_scripts/v2/reproduce
:
Contains fsurdat
(grep -l fsurdat run.*.sh
):
run.v2.LR.amip_0101.sh
run.v2.LR.piClim-control.sh
run.v2.LR.piClim-histall_0021.sh
run.v2.NARRM.amip_0101.sh
run.v2.LR.amip_0101_bonus.sh
run.v2.LR.amip_0201.sh
run.v2.LR.amip_0301.sh
run.v2.LR.hist-all-xGHG-xaer_0251.sh
run.v2.LR.historical_0101_bonus.sh
run.v2.LR.piClim-histaer_0021.sh
run.v2.LR.piClim-histaer_0041.sh
run.v2.LR.piClim-histall_0041.sh
run.v2.NARRM.amip_0201.sh
run.v2.NARRM.amip_0301.sh
run.v2.NARRM.historical_0301.sh
Does not contain fsurdat
(grep -l -v fsurdat run.*.sh
):
run.v2.LR.hist-GHG_0101.sh
run.v2.LR.historical_0101.sh
run.v2.LR.piControl.sh
run.v2.NARRM.amip_0101.sh
run.v2.NARRM.piControl.sh
It's interesting that out of the 5 scripts tested that don't contain fsurdat
, only run.v2.LR.historical_0101.sh
fails because of it.
@golaz I ran a test of run.v2.LR.historical_0101.sh
with check_finidat_fsurdat_consistency = .false.
added and it does in fact pass. Should we just add that line to every reproduction script?
I'm still not sure why v2.NARRM.amip_0101
is taking so long.
I ran the remaining LR tests. With the original representative set, we have:
Line count & checksum tests pass:
v2.LR.piControl
v2.LR.historical_0101 # With `fsurdat` line added
v2.LR.hist-GHG_0101
v2.LR.amip_0101
v2.LR.piClim-control
v2.LR.piClim-histall_0021
v2.NARRM.piControl
v2.NARRM.historical_0101
v2.LR.abrupt-4xCO2_0101
v2.LR.1pctCO2_0101
v2.LR.piClim-histaer_0021
v2.LR.piClim-histaer_0041
Checksum test fails:
v2.LR.abrupt-4xCO2_0301
v2.LR.hist-GHG_0201
v2.LR.hist-GHG_0251
v2.LR.hist-GHG_0301
v2.LR.hist-aer_0201
v2.LR.hist-aer_0251
v2.LR.hist-aer_0301
v2.LR.hist-all-xGHG-xaer_0101
v2.LR.hist-all-xGHG-xaer_0251
v2.LR.hist-all-xGHG-xaer_0301
v2.LR.amip_0201
v2.LR.amip_0301
Both tests fail:
v2.NARRM.amip_0101
v2.LR.historical_0151
v2.LR.historical_0201
v2.LR.historical_0251
v2.LR.historical_0301
v2.LR.historical_0101_bonus
v2.LR.hist-all-xGHG-xaer_0201
v2.LR.amip_0101_bonus
v2.LR.piClim-histall_0041
Also tested remaining NAARM scripts:
Checksum test fails:
v2.NARRM.abrupt-4xCO2_0101
v2.NARRM.1pctCO2_0101
v2.NARRM.historical_0301
Both tests fail:
v2.NARRM.amip_0201
v2.NARRM.amip_0301
Going to merge this PR with the passing reproduction scripts. In another PR, I will update the documentation page to include these scripts. A 3rd PR will address the currently failing reproduction scripts.
Add reproduction scripts. Resolves #23.
main
branch -- https://github.com/E3SM-Project/e3sm_data_docs/blob/main/utils/generate_tables.py#L53run_script_reproduction = f"https://github.com/E3SM-Project/e3sm_data_docs/tree/main/run_scripts/v2/reproduce/run.{name}.sh"
)