Closed forsyth2 closed 4 months ago
@czender Hi Charlie, could you help to review if the logic here for get prc_typ
still good for latest nco? https://github.com/E3SM-Project/zppy/blob/08838055a8ff26fc9404f87f7c73466ff04f8506/zppy/utils.py#L143-L169
The needed change is to make prc_typ = clm
for clm2
output.
Yes, I expect this will work fine.
Thank you, Charlie!
@forsyth2 The new commit https://github.com/E3SM-Project/zppy/pull/602/commits/48988df73a1072f4a38fe89e3b7e6c76aef8c137 looks good. I don't think we need https://github.com/E3SM-Project/zppy/pull/602/commits/d936b854d32b889e112c8b5a3ed274a1c870e7df with this change.
$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-600-both-commits/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
# No results => everything worked
$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-600-1st-commit-only/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851.status:WAITING 548863
e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851.status:WAITING 548862
e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851.status:WAITING 548864
tc_analysis_1850-1851.status:RUNNING 548854
tc_analysis_1852-1853.status:WAITING 548855
$ sq
USER ACCOUNT NODE PARTITION JOBID ST REASON TIME TIME_LIMIT NAME
ac.forsy e3sm 1 debug 548855 PD Dependency 0:00 30:00 tc_analysis_1852-1853
ac.forsy e3sm 1 debug 548862 PD Dependency 0:00 30:00 e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851
ac.forsy e3sm 1 compute 548863 PD Dependency 0:00 5:00:00 e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851
ac.forsy e3sm 1 compute 548864 PD Dependency 0:00 30:00 e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851
The other commit is in fact necessary. Without it, 3 e3sm_diags
jobs don't run because they are waiting on jobs that don't exist. TC Analysis also seems to fail on 1850-1851, meaning 1852-1853 can't run.
It appears the both-commits test was done without the #611 commit, (meaning the global time series plots don't look right), so I need to re-test with that. But it does look like we need both commits.
@chengzhuzhang This is my testing summary so far. I'm hoping to get back to this later in the afternoon or tomorrow.
Also, I think there was some confusion over my debugging/testing process. To clarify, when I say "test off latest main" I mean rebasing the code off the latest main
branch. This was important because the latest commit on main
, #611, changed testing results, so it was important to test off the latest main
. The main
branch itself is unaffected -- there is no merging from this branch yet.
Furthermore, the issues I describe below specifically come up in the context of the complete-run test, so it is only possible to narrow scope/runtime so much. (The dry-run parameter is helpful, for instance, for checking pre-runtime things like dependencies). At this point, I am confident that the both-commits version is working.
But it seems that what you were getting at in yesterday's meeting was that we really shouldn't need the 1st commit (the ts.bash
edits). In that case, see the below testing log for dropping the first commit. Again, I'm planning to do further debugging myself later, but so far all I can see is that something is happening with the TC analysis.
For reference, the respective changes:
1st commit -- ts.bash
edits:
--prc_typ={{ prc_typ }}
=>
{%- if input_files.split(".")[0] == 'cam' or input_files.split(".")[0] == 'eam' or input_files.split(".")[0] == 'elm' or input_files.split(".")[0] == 'clm2' %}
--prc_typ={{ input_files.split(".")[0][:3] }}
{%- else %}
--prc_typ={{ prc_typ }}
{%- endif %}
2nd commit -- utils.py
edits:
elif tmp in ("clm2", "elm"):
=>
elif tmp in ("clm2",):
component = "lnd"
prc_typ = "clm"
elif tmp in ("elm",):
test output dir | branch | base commit | conda env | Ran pip install . && python tests/integration/utils.py ? |
lessons learned |
---|---|---|---|---|---|
/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-611-both-commits/v2.LR.historical_0201/post/scripts/ |
issue-421-fix | Add center times (611) | zppy_dev_n600 | y | No errors. Can proceed with bundles run and integration testing. |
With both the zppy/templates/ts.bash
changes and the zppy/utils.py
changes, the output is ready for final testing. (The branch with both commits actually did previously pass the tests aside from the global-time-series plot test -- and I'm almost certain that was failing because I hadn't yet rebased to include #611 -- hence why rebasing to include #611 was important).
ts.bash
edits is dropped, leaving only the utils.py
edits)test output dir | branch | base commit | conda env | Ran pip install . && python tests/integration/utils.py ? |
lessons learned |
---|---|---|---|---|---|
/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-600-1st-commit-only/v2.LR.historical_0201/post/scripts/ |
issue-421-fix-drop-1st-commit | Add center times (611) | zppy_dev_n600 | y | Note the unique_id is a misnomer. It should really be drop-1st-commit . grep -v "OK" *status shows e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851 , e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851 , e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851 , tc_analysis_1852-1853 waiting, and their jobs were still in the squeue output marked as waiting for a dependency. There's also tc_analysis_1850-1851.status:RUNNING 548854 . So we have a couple problems:1) e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851 and e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851 are waiting on jobs that apparently don't exist.2) e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851 and tc_analysis_1852-1853 are waiting on tc_analysis_1850-1851 , which is apparently silently failing. The .o file indicates /var/spool/slurmd/job548854/slurm_script: line 114: /lcrc/globalscratch/ac.forsyth2//tc-analysis_1850_1851/aew_v2.LR.historical_0201_1850_1851.txt: No such file or directory . This doesn't make any sense though because tc_analysis doesn't depend on the ts task, so dropping that commit should not have affected this.Going to try re-running. |
(Rerunning incomplete tasks) | issue-421-fix-drop-1st-commit | Add center times (611) | zppy_dev_n600 | y | Nothing actually re-ran because zppy won't re-run tasks that say "WAITING" or "RUNNING". Need to delete those status files. |
(Rerunning incomplete tasks) | issue-421-fix-drop-1st-commit | Add center times (611) | zppy_dev_n600 | y | It doesn't seem like it was a one-time issue. The bad status files still occur. Turning on dry-run to take a look at dependencies. |
(Rerunning incomplete tasks) | issue-421-fix-drop-1st-commit | Add center times (611) | zppy_dev_n600 | y | Dependency file paths are long. Changing print statement to print([os.path.basename(d) for d in dependencies]) |
(Rerunning incomplete tasks) | issue-421-fix-drop-1st-commit | Add center times (611) | zppy_dev_n600 | y | e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851 and e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851 have identical dependenices: ['tc_analysis_1850-1851.status', 'climo_atm_monthly_180x360_aave_1852-1853.status', 'climo_atm_monthly_diurnal_8xdaily_180x360_aave_1852-1853.status', 'tc_analysis_1852-1853.status', 'ts_atm_monthly_180x360_aave_1852-1853-0002.status', 'ts_rof_monthly_1852-1853-0002.status'] . It looks like everything else is blocked because TC Analysis 1850-1851 is failing. I'm not sure why tc_analysis is showing up as a dependecy for the land model-vs-model when it has sets = "lat_lon_land", only. In the tc_analysis .o file, I see NetCDF: HDF error /var/spool/slurmd/job552637/slurm_script: line 80: 331340 Killed GenerateCSMesh --res $res --alt --file ${result_dir}outCSne$res.g , but that isn't particularly helpful. |
@chengzhuzhang I just discovered the same TC analysis error testing a completely different PR (#612), so that leads me to believe it's a problem on my end, perhaps with my scratch space (since I know TC Analysis makes use of that). What I don't understand is why the both-commits version didn't run into that issue when running the complete-run test.
@forsyth2 I reviewed code change in first commit, and second commit. If I understand correctly both should take identical effect. I don't know why keeping both commits works, while drop the first commit doesn't...
Also in the /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-600-1st-commit-only/v2.LR.historical_0201/post/scripts/
, i saw all stutus files show okay
. Am I in the wrong output directory?
i saw all stutus files show okay.
I deleted the WAITING/RUNNING status files so I could re-run. I was re-running with dry-run so they haven't been reproduced yet.
I suspect the error is actually unrelated to this PR, being as I ran into a similar issue elsewhere (https://github.com/E3SM-Project/zppy/pull/602#issuecomment-2231678092). That will require more debugging to see what's wrong on my end.
@forsyth2 Got it. Thank you for clarifying. I agree there might be a separate issue, which is irrelevant to code change in the PR.
For reference, code changes summary:
zppy/templates/ts.bash
:--prc_typ={{ input_files.split(".")[0] }}
=> --prc_typ={{ input_files.split(".")[0][:3] }}
{% if input_files == 'elm.h0' -%}
=> {% if input_files.split(".")[0] == 'clm2' or input_files.split(".")[0] == 'elm' -%}
{% if input_files == 'eam.h0' -%}
=> {% if input_files.split(".")[0] == 'cam' or input_files.split(".")[0] == 'eam' -%}
zppy/templates/ts.bash
:{%- if input_files.split(".")[0] == 'cam' or input_files.split(".")[0] == 'eam' or input_files.split(".")[0] == 'elm' or input_files.split(".")[0] == 'clm2' %}
--prc_typ={{ input_files.split(".")[0][:3] }}
{%- else %}
--prc_typ=sgs
{%- endif %}
=> --prc_typ={{ prc_typ }}
where prc_typ
is defined in zppy/ts.py
:
c["component"], c["prc_typ"] = getComponent(
c["input_component"], c["input_files"]
)
The zppy/utils.py
change in this PR applies the following in that function:
elif tmp in ("clm2", "elm"):
=>
elif tmp in ("clm2",):
component = "lnd"
prc_typ = "clm"
elif tmp in ("elm",):
Confirmed the issue is #613, doing final checks now.
Fix clm2 support. Follow-up fix to #421, addressing an accidental reversion in #424. (See https://github.com/E3SM-Project/zppy/pull/421#issuecomment-2168953021 for details)