E3SM-Project / e3sm_data_docs

Documentation on E3SM simulations

https://e3sm-project.github.io/e3sm_data_docs/

0 stars 0 forks source link

Add reproduction scripts #25

Closed forsyth2 closed 1 year ago

forsyth2 commented 1 year ago

Add reproduction scripts. Resolves #23.

[x] Generate remaining reproduction scripts
[x] Run a representative set of the reproduction scripts to check that they work
[ ] Make any necessary fixes to the auto-generator scripts, re-run that set of reproduction scripts.
[ ] Update the table of reproduction scripts (will do as new pull request since proper testing requires the scripts to be on the main branch -- https://github.com/E3SM-Project/e3sm_data_docs/blob/main/utils/generate_tables.py#L53 run_script_reproduction = f"https://github.com/E3SM-Project/e3sm_data_docs/tree/main/run_scripts/v2/reproduce/run.{name}.sh")

forsyth2 commented 1 year ago

I ran a representative set of the reproduction scripts. I did discover that the current iteration of patch_helper.py creates two definitions of the CASE_NAME variable and also sets PELAYOUT and WALLTIME incorrectly in the non-production run case. To run this representative set, I manually changed the reproduction scripts to fix these issues.

Passed: v2.LR.hist-GHG_0101, v2.LR.amip_0101, v2.LR.piClim-histall_0021, v2.NARRM.piControl, v2.NARRM.historical_0101

Remaining issues:

v2.LR.historical_0101: e3sm.log.285769.230216-164230

675: MPI_ABORT was invoked on rank 675 in communicator MPI_COMM_WORLD
675: with errorcode 1001.
675:
675: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
675: You may or may not see output from other processes, depending on
675: exactly when Open MPI kills them.

v2.LR.historical_0201: e3sm.log.285811.230216-200219

648: MPI_ABORT was invoked on rank 648 in communicator MPI_COMM_WORLD
648: with errorcode 1001.
648:
648: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
648: You may or may not see output from other processes, depending on
648: exactly when Open MPI kills them.

v2.LR.piClim-control: run_tests.o286524

ERROR: Reference case directory /lcrc/group/e3sm/ac.forsyth2/E3SMv2_test/v2.LR.piClim-control/init does not exist or is not readable

v2.NARRM.amip_0101: e3sm.log.289007.230222-182740

srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
  0: slurmstepd: error: *** STEP 289007.0 ON chr-0498 CANCELLED AT 2023-02-22T22:28:00 DUE TO TIME LIMIT ***

(even after increasing time from 00:20:00 to 04:00:00).

forsyth2 commented 1 year ago

There is now a large degree of automation to this process. This was necessary because ~40 reproduction scripts had to be created properly. Trying to do so manually is extremely prone to errors, notably:

Forgetting to change a particular line from the original script.
Forgetting which script I'm on and thus wasting time duplicating an already completed script, or worse, mixing together two cases in a single script.

We have to create ~40 reproduction scripts given a) the original scripts and b) the diff between the original piControl and the reproduction piControl scripts.

We use (1) generate_reproduction_script.sh to generate a single reproduction script. First, we apply (2) diff_patch (the piControl diff) to the original script. This is, however, insufficient, as the diff_patch is imperfect when applied to original scripts other than piControl. We therefore use (3) patch_helper.py to make more significant changes to the newly generated reproduction scripts. (3) goes through the reproduction script line by line, checking for inconsistencies that must be fixed.

I initially ran (1) manually for each case. As I made change after to change to how it works, it became necessary to re-generate the reproduction scripts in a consistent, fast manner. The script (4) update_reproduction_scripts.sh handles that.

Now, we theoretically have finished reproduction run scripts. However, we should test at least a representative sample to make sure they work as expected. Running (5) test_reproduction_scripts.bash will do just that. After checking that all jobs are finished -- a little more than 4 hours after (5) starts running -- we can check the results with (6) check_results.bash

That's 6 files to automate this process, thus ensuring accuracy and repeatability. Documentation on running these scripts can be found at utils/README.md (https://github.com/E3SM-Project/e3sm_data_docs/tree/n23-reproduce-scripts/utils).

forsyth2 commented 1 year ago

Output of latest check_results run:

v2.LR.piControl
Line count test passed
Checksum test passed

v2.LR.historical_0101
gzip: XS_1x10_ndays/run/atm.log.*.gz: No such file or directory
Line count test failed
0 atm_XS_1x10_ndays.txt
482 atm_XS_1x10_ndays.txt
Checksum test failed
d41d8cd98f00b204e9800998ecf8427e atm_XS_1x10_ndays.txt
61a7f492bdcc6e6cd4a2b41c92546219 atm_XS_1x10_ndays.txt

v2.LR.hist-GHG_0101
Line count test passed
Checksum test passed

v2.LR.amip_0101
Line count test passed
Checksum test passed

v2.LR.piClim-control
Line count test passed
Checksum test passed

v2.LR.piClim-histall_0021
Line count test passed
Checksum test passed

v2.NARRM.piControl
Line count test passed
Checksum test passed

v2.NARRM.historical_0101
Line count test passed
Checksum test passed

v2.NARRM.amip_0101
gzip: XS_1x10_ndays/run/atm.log.*.gz: No such file or directory
Line count test failed
0 atm_XS_1x10_ndays.txt
482 atm_XS_1x10_ndays.txt
Checksum test failed
d41d8cd98f00b204e9800998ecf8427e atm_XS_1x10_ndays.txt
930b7fc7e946910c3c8e716f733d0f31 atm_XS_1x10_ndays.txt

v2.LR.historical_0101 still fails with the MPI error
I left v2.LR.historical_0201 out since it had the same MPI error as v2.LR.historical_0101
v2.LR.piClim-control works now, due to the f_out.write(f'readonly RUN_REFDIR="/lcrc/group/e3sm/${{USER}}/E3SMv2_test/{init_case_name}/init"\n') change in patch_helper.py, which makes the reproduction script pick up the init from the original script rather than assuming it uses its own init
v2.NARRM.amip_0101 still fails with the time limit error.

forsyth2 commented 1 year ago

@golaz I've automated a great deal of this process and made some necessary fixes (see the 2 comments above). I still have a couple remaining errors, which I'm looking into:

`v2.LR.historical_0101`

lnd.log.295142.230307-123432 shows

 ERROR: Initial conditions file (finidat) was generated from a different surface
  dataset
 than the one being used for the current simulation (fsurdat).
 Current fsurdat: surfdata_ne30pg2_simyr1850_c210402.nc
 Surface dataset used to generate initial conditions file:
 surfdata_ne30np4.pg2_simyr1850_c201210.nc

 Possible solutions to this problem:
 (1) Make sure you are using the correct surface dataset and initial conditions
 file
 (2) If you generated the surface dataset and/or initial conditions file yoursel
 f,
     then you may need to manually change the surface_dataset global attribute o
 n the
     initial conditions file (e.g., using ncatted)
 (3) If you are confident that you are using the correct surface dataset and ini
 tial conditions file,
     yet are still experiencing this error, then you can bypass this check by se
 tting:
       check_finidat_fsurdat_consistency = .false.
     in user_nl_elm

 ENDRUN:
 ERROR in restFileMod.F90 at line 1382

 ERROR: Unknown error submitted to shr_abort_abort.

I recognized this error from when we ran the original simulations. Looking at an email from September 2021, possible solutions included:

# (1) Set
check_finidat_fsurdat_consistency = .false.

# or (2) set land surface file back
fsurdat = '/lcrc/group/e3sm/data/inputdata/lnd/clm2/surfdata_map/surfdata_ne30np4.pg2_simyr1850_c201210.nc'

I ran grep fsurdat run*.sh in /home/ac.forsyth2/e3sm_data_docs/run_scripts/v2/original and v2.LR.historical_0101 didn't show up. I wonder if check_finidat_fsurdat_consistency = .false. is now a necessity that should be added to all reproduction scripts. I can try running this script with that change to check.

`v2.NARRM.amip_0101`

$ cd /lcrc/group/e3sm/ac.forsyth2/E3SMv2_test/v2.NARRM.amip_0101/tests/XS_1x10_ndays/run
$ grep "DATE" atm.log.295219.230307-155526
# 1870/01/01 - 1870/01/04.

It seems to only get through 4 days in 20 minutes. However, I tried setting the limit as high as 4 hours (see first comment) and it still failed. /home/ac.forsyth2/E3SMv2_test/scripts/run.v2.NARRM.amip_0101.sh shows readonly run='XS_1x10_ndays', so it should only be running 10 days.

forsyth2 commented 1 year ago

Looking in /home/ac.forsyth2/e3sm_data_docs/run_scripts/v2/reproduce:

Contains fsurdat (grep -l fsurdat run.*.sh):

In tested set: 4 scripts

run.v2.LR.amip_0101.sh
run.v2.LR.piClim-control.sh
run.v2.LR.piClim-histall_0021.sh
run.v2.NARRM.amip_0101.sh

Not tested: 11 scripts

run.v2.LR.amip_0101_bonus.sh
run.v2.LR.amip_0201.sh
run.v2.LR.amip_0301.sh
run.v2.LR.hist-all-xGHG-xaer_0251.sh
run.v2.LR.historical_0101_bonus.sh
run.v2.LR.piClim-histaer_0021.sh
run.v2.LR.piClim-histaer_0041.sh
run.v2.LR.piClim-histall_0041.sh
run.v2.NARRM.amip_0201.sh
run.v2.NARRM.amip_0301.sh
run.v2.NARRM.historical_0301.sh

Does not contain fsurdat (grep -l -v fsurdat run.*.sh):

In tested set: 5 scripts

run.v2.LR.hist-GHG_0101.sh
run.v2.LR.historical_0101.sh
run.v2.LR.piControl.sh
run.v2.NARRM.amip_0101.sh
run.v2.NARRM.piControl.sh

Not tested: the remaining 21 scripts

It's interesting that out of the 5 scripts tested that don't contain fsurdat, only run.v2.LR.historical_0101.sh fails because of it.

forsyth2 commented 1 year ago

@golaz I ran a test of run.v2.LR.historical_0101.sh with check_finidat_fsurdat_consistency = .false. added and it does in fact pass. Should we just add that line to every reproduction script?

I'm still not sure why v2.NARRM.amip_0101 is taking so long.

forsyth2 commented 1 year ago

I ran the remaining LR tests. With the original representative set, we have:

Line count & checksum tests pass:

v2.LR.piControl
v2.LR.historical_0101 # With `fsurdat` line added
v2.LR.hist-GHG_0101
v2.LR.amip_0101
v2.LR.piClim-control
v2.LR.piClim-histall_0021
v2.NARRM.piControl
v2.NARRM.historical_0101
v2.LR.abrupt-4xCO2_0101
v2.LR.1pctCO2_0101
v2.LR.piClim-histaer_0021
v2.LR.piClim-histaer_0041

Checksum test fails:

v2.LR.abrupt-4xCO2_0301
v2.LR.hist-GHG_0201
v2.LR.hist-GHG_0251
v2.LR.hist-GHG_0301
v2.LR.hist-aer_0201
v2.LR.hist-aer_0251
v2.LR.hist-aer_0301
v2.LR.hist-all-xGHG-xaer_0101
v2.LR.hist-all-xGHG-xaer_0251
v2.LR.hist-all-xGHG-xaer_0301
v2.LR.amip_0201
v2.LR.amip_0301

Both tests fail:

v2.NARRM.amip_0101
v2.LR.historical_0151
v2.LR.historical_0201
v2.LR.historical_0251
v2.LR.historical_0301
v2.LR.historical_0101_bonus
v2.LR.hist-all-xGHG-xaer_0201
v2.LR.amip_0101_bonus
v2.LR.piClim-histall_0041

forsyth2 commented 1 year ago

Also tested remaining NAARM scripts:

Checksum test fails:

v2.NARRM.abrupt-4xCO2_0101
v2.NARRM.1pctCO2_0101
v2.NARRM.historical_0301

Both tests fail:

v2.NARRM.amip_0201
v2.NARRM.amip_0301

forsyth2 commented 1 year ago

Going to merge this PR with the passing reproduction scripts. In another PR, I will update the documentation page to include these scripts. A 3rd PR will address the currently failing reproduction scripts.