E3SM-Project / zppy

E3SM post-processing toolchain
BSD 3-Clause "New" or "Revised" License
6 stars 15 forks source link

Fix clm2 support #602

Closed forsyth2 closed 4 months ago

forsyth2 commented 5 months ago

Fix clm2 support. Follow-up fix to #421, addressing an accidental reversion in #424. (See https://github.com/E3SM-Project/zppy/pull/421#issuecomment-2168953021 for details)

chengzhuzhang commented 5 months ago

@czender Hi Charlie, could you help to review if the logic here for get prc_typ still good for latest nco? https://github.com/E3SM-Project/zppy/blob/08838055a8ff26fc9404f87f7c73466ff04f8506/zppy/utils.py#L143-L169 The needed change is to make prc_typ = clm for clm2 output.

chengzhuzhang commented 5 months ago

Yes, I expect this will work fine.

Thank you, Charlie!

chengzhuzhang commented 5 months ago

@forsyth2 The new commit https://github.com/E3SM-Project/zppy/pull/602/commits/48988df73a1072f4a38fe89e3b7e6c76aef8c137 looks good. I don't think we need https://github.com/E3SM-Project/zppy/pull/602/commits/d936b854d32b889e112c8b5a3ed274a1c870e7df with this change.

forsyth2 commented 4 months ago
$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-600-both-commits/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
# No results => everything worked

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-600-1st-commit-only/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851.status:WAITING 548863
e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851.status:WAITING 548862
e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851.status:WAITING 548864
tc_analysis_1850-1851.status:RUNNING 548854
tc_analysis_1852-1853.status:WAITING 548855
$ sq
USER     ACCOUNT NODE PARTITION JOBID   ST     REASON       TIME TIME_LIMIT NAME
ac.forsy    e3sm    1     debug 548855  PD Dependency       0:00      30:00 tc_analysis_1852-1853
ac.forsy    e3sm    1     debug 548862  PD Dependency       0:00      30:00 e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851
ac.forsy    e3sm    1   compute 548863  PD Dependency       0:00    5:00:00 e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851
ac.forsy    e3sm    1   compute 548864  PD Dependency       0:00      30:00 e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851

The other commit is in fact necessary. Without it, 3 e3sm_diags jobs don't run because they are waiting on jobs that don't exist. TC Analysis also seems to fail on 1850-1851, meaning 1852-1853 can't run.

forsyth2 commented 4 months ago

It appears the both-commits test was done without the #611 commit, (meaning the global time series plots don't look right), so I need to re-test with that. But it does look like we need both commits.

forsyth2 commented 4 months ago

@chengzhuzhang This is my testing summary so far. I'm hoping to get back to this later in the afternoon or tomorrow.

Also, I think there was some confusion over my debugging/testing process. To clarify, when I say "test off latest main" I mean rebasing the code off the latest main branch. This was important because the latest commit on main, #611, changed testing results, so it was important to test off the latest main. The main branch itself is unaffected -- there is no merging from this branch yet.

Furthermore, the issues I describe below specifically come up in the context of the complete-run test, so it is only possible to narrow scope/runtime so much. (The dry-run parameter is helpful, for instance, for checking pre-runtime things like dependencies). At this point, I am confident that the both-commits version is working.

But it seems that what you were getting at in yesterday's meeting was that we really shouldn't need the 1st commit (the ts.bash edits). In that case, see the below testing log for dropping the first commit. Again, I'm planning to do further debugging myself later, but so far all I can see is that something is happening with the TC analysis.

For reference, the respective changes:

1st commit -- ts.bash edits:

--prc_typ={{ prc_typ }}

=>

{%- if input_files.split(".")[0] == 'cam' or input_files.split(".")[0] == 'eam' or input_files.split(".")[0] == 'elm' or input_files.split(".")[0] == 'clm2'  %}
--prc_typ={{ input_files.split(".")[0][:3] }}
{%- else %}
--prc_typ={{ prc_typ }}
{%- endif %}

2nd commit -- utils.py edits:

elif tmp in ("clm2", "elm"):

=>

    elif tmp in ("clm2",):
        component = "lnd"
        prc_typ = "clm"
    elif tmp in ("elm",):

Both commits

test output dir branch base commit conda env Ran pip install . && python tests/integration/utils.py? lessons learned
/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-611-both-commits/v2.LR.historical_0201/post/scripts/ issue-421-fix Add center times (611) zppy_dev_n600 y No errors. Can proceed with bundles run and integration testing.

With both the zppy/templates/ts.bash changes and the zppy/utils.py changes, the output is ready for final testing. (The branch with both commits actually did previously pass the tests aside from the global-time-series plot test -- and I'm almost certain that was failing because I hadn't yet rebased to include #611 -- hence why rebasing to include #611 was important).

Dropping the first commit (ts.bash edits is dropped, leaving only the utils.py edits)

test output dir branch base commit conda env Ran pip install . && python tests/integration/utils.py? lessons learned
/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-600-1st-commit-only/v2.LR.historical_0201/post/scripts/ issue-421-fix-drop-1st-commit Add center times (611) zppy_dev_n600 y Note the unique_id is a misnomer. It should really be drop-1st-commit. grep -v "OK" *status shows e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851, e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851, e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851, tc_analysis_1852-1853 waiting, and their jobs were still in the squeue output marked as waiting for a dependency. There's also tc_analysis_1850-1851.status:RUNNING 548854. So we have a couple problems:

1) e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851 and e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851 are waiting on jobs that apparently don't exist.

2) e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851 and tc_analysis_1852-1853 are waiting on tc_analysis_1850-1851, which is apparently silently failing. The .o file indicates /var/spool/slurmd/job548854/slurm_script: line 114: /lcrc/globalscratch/ac.forsyth2//tc-analysis_1850_1851/aew_v2.LR.historical_0201_1850_1851.txt: No such file or directory. This doesn't make any sense though because tc_analysis doesn't depend on the ts task, so dropping that commit should not have affected this.

Going to try re-running.
(Rerunning incomplete tasks) issue-421-fix-drop-1st-commit Add center times (611) zppy_dev_n600 y Nothing actually re-ran because zppy won't re-run tasks that say "WAITING" or "RUNNING". Need to delete those status files.
(Rerunning incomplete tasks) issue-421-fix-drop-1st-commit Add center times (611) zppy_dev_n600 y It doesn't seem like it was a one-time issue. The bad status files still occur. Turning on dry-run to take a look at dependencies.
(Rerunning incomplete tasks) issue-421-fix-drop-1st-commit Add center times (611) zppy_dev_n600 y Dependency file paths are long. Changing print statement to print([os.path.basename(d) for d in dependencies])
(Rerunning incomplete tasks) issue-421-fix-drop-1st-commit Add center times (611) zppy_dev_n600 y e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851 and e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851 have identical dependenices: ['tc_analysis_1850-1851.status', 'climo_atm_monthly_180x360_aave_1852-1853.status', 'climo_atm_monthly_diurnal_8xdaily_180x360_aave_1852-1853.status', 'tc_analysis_1852-1853.status', 'ts_atm_monthly_180x360_aave_1852-1853-0002.status', 'ts_rof_monthly_1852-1853-0002.status']. It looks like everything else is blocked because TC Analysis 1850-1851 is failing. I'm not sure why tc_analysis is showing up as a dependecy for the land model-vs-model when it has sets = "lat_lon_land", only. In the tc_analysis .o file, I see NetCDF: HDF error /var/spool/slurmd/job552637/slurm_script: line 80: 331340 Killed GenerateCSMesh --res $res --alt --file ${result_dir}outCSne$res.g, but that isn't particularly helpful.
forsyth2 commented 4 months ago

@chengzhuzhang I just discovered the same TC analysis error testing a completely different PR (#612), so that leads me to believe it's a problem on my end, perhaps with my scratch space (since I know TC Analysis makes use of that). What I don't understand is why the both-commits version didn't run into that issue when running the complete-run test.

chengzhuzhang commented 4 months ago

@forsyth2 I reviewed code change in first commit, and second commit. If I understand correctly both should take identical effect. I don't know why keeping both commits works, while drop the first commit doesn't... Also in the /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-600-1st-commit-only/v2.LR.historical_0201/post/scripts/, i saw all stutus files show okay. Am I in the wrong output directory?

forsyth2 commented 4 months ago

i saw all stutus files show okay.

I deleted the WAITING/RUNNING status files so I could re-run. I was re-running with dry-run so they haven't been reproduced yet.

I suspect the error is actually unrelated to this PR, being as I ran into a similar issue elsewhere (https://github.com/E3SM-Project/zppy/pull/602#issuecomment-2231678092). That will require more debugging to see what's wrong on my end.

chengzhuzhang commented 4 months ago

@forsyth2 Got it. Thank you for clarifying. I agree there might be a separate issue, which is irrelevant to code change in the PR.

forsyth2 commented 4 months ago

For reference, code changes summary:

421 zppy/templates/ts.bash:

424 zppy/templates/ts.bash:

{%- if input_files.split(".")[0] == 'cam' or input_files.split(".")[0] == 'eam' or input_files.split(".")[0] == 'elm' or input_files.split(".")[0] == 'clm2'  %}
--prc_typ={{ input_files.split(".")[0][:3] }}
{%- else %}
--prc_typ=sgs
{%- endif %}

=> --prc_typ={{ prc_typ }} where prc_typ is defined in zppy/ts.py:

c["component"], c["prc_typ"] = getComponent(
            c["input_component"], c["input_files"]
        )

The zppy/utils.py change in this PR applies the following in that function: elif tmp in ("clm2", "elm"): =>

    elif tmp in ("clm2",):
        component = "lnd"
        prc_typ = "clm"
    elif tmp in ("elm",):
forsyth2 commented 4 months ago

Confirmed the issue is #613, doing final checks now.