E3SM-Project / zppy

E3SM post-processing toolchain
BSD 3-Clause "New" or "Revised" License
6 stars 14 forks source link

Address "flaky" `ts` subtask in `complete_run` test #549

Closed forsyth2 closed 6 months ago

forsyth2 commented 6 months ago

Request criteria

Issue description

The complete_run integration test is "flaky" (e.g., as in https://docs.gitlab.com/ee/development/testing_guide/flaky_tests.html). Specifically, the [[ ts ]] > [[ land_monthly ]] subtask.

I was able to get the test passing while testing #424. However, it appears I never ran the test straight-through and instead did several re-runs, which actually had the effect of creating a different final output than if it had been run straight through.

I came across this while testing #548. The subtask was suddenly failing despite no relevant changes being made. I then tested on the main branch (specifically at a9025491014b241449a14818a1c9bed98fdd8233) and encountered the same errors. I found the following.

Run 1

Relevant sections from tests/integration/generated/test_complete_run_chrysalis.cfg:

[default]
case = v2.LR.historical_0201
constraint = ""
dry_run = "False"
environment_commands = ""
input = "/lcrc/group/e3sm/ac.forsyth2//E3SMv2/v2.LR.historical_0201"
input_subdir = archive/atm/hist
mapping_file = "map_ne30pg2_to_cmip6_180x360_aave.20200201.nc"
# To run this test, edit `output` and `www` in this file, along with `actual_images_dir` in test_complete_run.py
output = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201"
partition = "debug"
qos = "regular"
www = "/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/test-main2-20240216"

[ts]
active = True
walltime = "00:30:00"
years = "1850:1854:2",

  [[ land_monthly ]]
  extra_vars = "landfrac"
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  vars = "FSH,RH2M"
  ts_fmt = "cmip"

[ilamb]
active = True
grid = '180x360_aave'
short_name = 'v2.LR.historical_0201'
ts_num_years = 2
years = "1850:1854:2",

Then:

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
ilamb_1850-1851.status:WAITING 472320
ilamb_1852-1853.status:WAITING 472321
ts_land_monthly_1850-1851-0002.status:ERROR (4)
ts_land_monthly_1852-1853-0002.status:ERROR (4)
# The ILAMB tasks are waiting for the failed `ts_land_monthly` jobs.
$ tail -n 16 ts_land_monthly_1850-1851-0002.o472316 
2024-02-17 00:43:34,975 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-17 00:43:34,975 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-17 00:43:34,975_975:WARNING:derive_handlers:No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-17 00:43:34,975_975:INFO:_get_handlers:--------------------------------------
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-17 00:43:34,975_975:INFO:_get_handlers:| Derived CMIP6 Variable Handlers
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-17 00:43:34,975_975:INFO:_get_handlers:--------------------------------------
2024-02-17 00:43:34,975 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-17 00:43:34,975 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-17 00:43:34,975_975:ERROR:_get_handlers:No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
srun: error: chr-0503: task 0: Exited with exit code 1
$ tail -n 16 ts_land_monthly_1852-1853-0002.o472317 
2024-02-17 00:43:34,975 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-17 00:43:34,975 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-17 00:43:34,975_975:WARNING:derive_handlers:No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-17 00:43:34,975_975:INFO:_get_handlers:--------------------------------------
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-17 00:43:34,975_975:INFO:_get_handlers:| Derived CMIP6 Variable Handlers
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-17 00:43:34,975 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-17 00:43:34,975_975:INFO:_get_handlers:--------------------------------------
2024-02-17 00:43:34,975 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-17 00:43:34,975 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-17 00:43:34,975_975:ERROR:_get_handlers:No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
srun: error: chr-0504: task 0: Exited with exit code 1

Run 2

I changed vars = "FSH,RH2M" to vars = "LAISHA,LAISUN" and re-ran (without deleting the output directory, meaning only failing jobs were re-ran).

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
ilamb_1850-1851.status:WAITING 472320
ilamb_1852-1853.status:WAITING 472321
ts_land_monthly_1850-1851-0002.status:ERROR (5)
# The ILAMB tasks didn't run because `zppy` ignores "WAITING" tasks.
# ts_land_monthly_1850-1851-0002 succeeds now!
$ tail -n 30 ts_land_monthly_1850-1851-0002.o472322
2024-02-17 00:51:37,787 [INFO]: handler.py(cmorize:209) >> lai: Starting CMORizing
2024-02-17 00:51:37,787 [INFO]: handler.py(cmorize:209) >> lai: Starting CMORizing
2024-02-17 00:51:37,787_787:INFO:cmorize:lai: Starting CMORizing
  0%|          | 0/1 [00:00<?, ?it/s]2024-02-17 00:51:37,974 [INFO]: handler.py(_setup_cmor_module:333) >> lai: CMOR setup complete
2024-02-17 00:51:37,974 [INFO]: handler.py(_setup_cmor_module:333) >> lai: CMOR setup complete
2024-02-17 00:51:37,974_974:INFO:_setup_cmor_module:lai: CMOR setup complete
2024-02-17 00:51:37,975 [INFO]: handler.py(cmorize:239) >> lai: loading E3SM variables dict_keys(['LAISHA', 'LAISUN'])
2024-02-17 00:51:37,975 [INFO]: handler.py(cmorize:239) >> lai: loading E3SM variables dict_keys(['LAISHA', 'LAISUN'])
2024-02-17 00:51:37,975_975:INFO:cmorize:lai: loading E3SM variables dict_keys(['LAISHA', 'LAISUN'])
2024-02-17 00:51:38,180 [INFO]: handler.py(cmorize:247) >> lai: creating CMOR variable with CMOR axis objects.
2024-02-17 00:51:38,180 [INFO]: handler.py(cmorize:247) >> lai: creating CMOR variable with CMOR axis objects.
2024-02-17 00:51:38,180_180:INFO:cmorize:lai: creating CMOR variable with CMOR axis objects.
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/site-packages/e3sm_to_cmip/__main__.py", line 912, in _run_parallel
    out = res.result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
100%|██████████| 1/1 [00:00<00:00,  2.03it/s]
2024-02-17 00:51:38,329 [INFO]: __main__.py(_run_parallel:930) >> 0 of 1 handlers complete
2024-02-17 00:51:38,329 [INFO]: __main__.py(_run_parallel:930) >> 0 of 1 handlers complete
2024-02-17 00:51:38,329_329:INFO:_run_parallel:0 of 1 handlers complete
2024-02-17 00:51:38,329 [ERROR]: __main__.py(_run_parallel:934) >> lai failed to complete
2024-02-17 00:51:38,329 [ERROR]: __main__.py(_run_parallel:934) >> lai failed to complete
2024-02-17 00:51:38,329_329:ERROR:_run_parallel:lai failed to complete
2024-02-17 00:51:38,329 [ERROR]: __main__.py(_run_parallel:935) >> 0 of 1 handlers complete
2024-02-17 00:51:38,329 [ERROR]: __main__.py(_run_parallel:935) >> 0 of 1 handlers complete
2024-02-17 00:51:38,329_329:ERROR:_run_parallel:0 of 1 handlers complete
'LAISUN'
mv: cannot stat '/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002/CMIP6/CMIP/*/*/*/*/*/*/*/*/*.nc': No such file or directory

A similar cannot stat with CMIP6/CMIP line appears in issues https://github.com/E3SM-Project/zppy/issues/523#issuecomment-1805025507, #543 and in PRs https://github.com/E3SM-Project/zppy/pull/424#issuecomment-1945009855, #533, https://github.com/E3SM-Project/zppy/pull/534#issuecomment-1906552473.

Run 3

I re-ran (without deleting the output directory, meaning only failing jobs were re-ran), without changing anything.

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
ilamb_1850-1851.status:WAITING 472320
ilamb_1852-1853.status:WAITING 472321
# The second `ts_land_monthly` subtask worked this time!

Run 4

I deleted the ilamb status files and re-ran (without deleting the output directory, meaning only the status-file-missing ILAMB jobs were re-ran)

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
# No failures! ILAMB worked too, now that the ts_land_monthly dependencies succeeded.

Conclusions

The ts_land_monthly subtask is flaky. Looking at https://docs.gitlab.com/ee/development/testing_guide/flaky_tests.html, it is flaky in the sense that there is a state leak.

=> While it is useful to have zppy only re-run the failed jobs, this does not in fact guarantee that all jobs would have succeeded had the latest cfg been used from the start!

=> All pre-PR-merging zppy runs should use a clean output directory, in order to produce valid integration test results.

Remaining questions

1) So, what went wrong with testing #424? Setting vars = "FSH,LAISHA,LAISUN,RH2M" caused an error (https://github.com/E3SM-Project/zppy/pull/424#issuecomment-1945009855) which was actually equivalent to the error in Run 2 above. Rather than re-running, I removed the erroring variables LAISHA,LAISUN (https://github.com/E3SM-Project/zppy/pull/424#issuecomment-1946804983) and got the job to work. But that doesn't make sense because that means #424 testing worked with vars = "FSH,RH2M", which clearly causes errors in Run 1 above.

2) Why did Run 3 work simply by re-running without changing anything? I have a vague memory of there potentially being some sort of race or lock condition in accessing the files in post/.... Could that be what's going on? Could the two ts_land_monthly jobs possibly be interacting with each other in any way on Run 1?

forsyth2 commented 6 months ago

@tomvothecoder Do you happen to have any thoughts on this? Unfortunately, integration tests are more state-dependent than unit tests, and on top of that, zppy itself is a highly state-dependent tool.

forsyth2 commented 6 months ago

Re: remaining question (2), seeing concurrent/futures/_base.py in the Run 2 stack trace indeed seems to point to signs of a race condition somehow...

chengzhuzhang commented 6 months ago

@forsyth2 I'm trying to study what is a flaky test. It is for repeating a same test many times, but get different results. In your cases, among the 4 runs, which are the same tests? I might be a little confused.

forsyth2 commented 6 months ago

I suppose I'm using the term somewhat loosely. My point is that the complete_run test can be made to complete successfully if the cfg is run multiple times, but will fail if run only once. Notably, nothing changes in the cfg at all for run 2->3 and run 3->4.

chengzhuzhang commented 6 months ago

@forsyth2 I think I found a bug when testing the configuration file copied from https://github.com/E3SM-Project/zppy/issues/549#issue-2139727555. With this configuration file, I got ilamb jobs skipped, because ts_atm_monthly_180x360_aave_1850-1851-0002.status status files are missing. By including an atm ts task would bypass this error. But Ilamb run should not require atm variables.

Configuration file validation passed.
ts_land_monthly_1850-1851-0002
...Submitted batch job 474065
ts_land_monthly_1852-1853-0002
...Submitted batch job 474066
ilamb_1850-1851
...skipping because of dependency status file missing
   /lcrc/group/e3sm/ac.zhang40/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201_try4/post/scripts/ts_atm_monthly_180x360_aave_1850-1851-0002.status
URL: https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.zhang40/zppy_test_complete_run_www/try4/v2.LR.historical_0201/ilamb
ilamb_1852-1853
...skipping because of dependency status file missing
   /lcrc/group/e3sm/ac.zhang40/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201_try4/post/scripts/ts_atm_monthly_180x360_aave_1850-1851-0002.status
URL: https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.zhang40/zppy_test_complete_run_www/try4/v2.LR.historical_0201/ilamb
chengzhuzhang commented 6 months ago

@forsyth2 I think I figured what caused the flaky test documented here and potentially the bundle run test documented in https://github.com/E3SM-Project/zppy/issues/543.

This is a bug in setting up e3sm_to_cmip runs in zppy. For e3sm_to_cmip it expects a single set of time series variables. While in this test, two sets of time series are presented in one file directory, EX

LAISUN_185201_185312.nc
LAISHA_185201_185312.nc
LAISUN_185001_185112.nc
LAISHA_185001_185112.nc

Whether the e3sm_to_cmip tasks fail or not, depends on the timing of the time series being generated. This bug is only revealed in the complete tests with this type of setting: years = "1850:1854:2". Running with a single time series will work just fine. exyears = "1850:1854:4"`.

forsyth2 commented 6 months ago

@chengzhuzhang Thank you for identifying the problem. We'll have to set up separate subdirectories to avoid this.

$ git grep -n "cmip_ts" *
zppy/templates/ilamb.bash:49:lnd_ts_for_ilamb={{ output }}/post/lnd/180x360_aave/cmip_ts/monthly/
zppy/templates/ilamb.bash:50:atm_ts_for_ilamb={{ output }}/post/atm/180x360_aave/cmip_ts/monthly/
zppy/templates/ts.bash:145:  dest_cmip={{ output }}/post/{{ component }}/{{ grid }}/cmip_ts/{{ frequency }}

I think we would add a directory after cmip_ts, maybe the years, e.g., "1850-1851". Then the ILAMB task would be able to pick out the correct subdirectory too.

Note: another option is to force ts tasks using ts_fmt = "cmip" to run sequentially. But I don't think that's an ideal option. It's what we did with tc_analysis in https://github.com/E3SM-Project/zppy/blob/a9025491014b241449a14818a1c9bed98fdd8233/zppy/tc_analysis.py#L88-L91 and https://github.com/E3SM-Project/zppy/blob/a9025491014b241449a14818a1c9bed98fdd8233/zppy/e3sm_diags.py#L244-L250

chengzhuzhang commented 6 months ago

I tried out the solution to save data separately, unfortunately the problem persist...

chengzhuzhang commented 6 months ago

And I can provide a minimum example that runs e3sm_to_cmip standalone, but won't be able to reproduce the errors.

e3sm_to_cmip --output-path ~/test -v 'lai' --realm lnd --input-path /lcrc/group/e3sm/ac.zhang40/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201_try7/post/lnd/180x360_aave/ts/monthly/2yr --user-metadata ~/zppy/zppy/templates/e3sm_to_cmip/default_metadata.json --tables-path /lcrc/group/e3sm/diagnostics/cmip6-cmor-tables/Tables --num-proc 12
forsyth2 commented 6 months ago

I tried out the solution to save data separately, unfortunately the problem persist...

I tried as well and indeed it persists.

Unfortunately, the concurrency error message from e3sm_to_cmip is not terribly helpful. Would we be ok with forcing the ts tasks to run sequentially? The ts tasks are not a significant time sink, unlike say E3SM Diags, so I don't think it would greatly increase zppy's run time.

forsyth2 commented 6 months ago

Actually, I'm not even sure running sequentially will help. I tried rerunning zppy and still get the same error. Then, I tried running with sbatch directly (which worked in https://github.com/E3SM-Project/zppy/pull/548#issuecomment-1955411123), but it failed this time.

Is it possible the error is in e3sm_to_cmip's handling of concurrency, rather than how zppy calls e3sm_to_cmip?

chengzhuzhang commented 6 months ago

@tomvothecoder could you please help look at this issue. At this point, we think the zppy issue is caused by how e3sm_to_cmip handles concurrency. The error can't be reproduced with single e3sm_to_cmip runs.

tomvothecoder commented 6 months ago

@tomvothecoder could you please help look at this issue. At this point, we think the zppy issue is caused by how e3sm_to_cmip handles concurrency. The error can't be reproduced with single e3sm_to_cmip runs.

Sure I can take a look. What is the priority for this issue and did we want to get it fixed before E3SM-Unified 1.9.3? If I understand correctly, E3SM-Unified 1.9.3 will probably be delayed for further zppy testing and debugging.

chengzhuzhang commented 6 months ago

@tomvothecoder thanks. Yes, addressing this issue would be a high priority for the E3SM-Unified 1.9.3 release. Ryan and I have minimal examples to test e3sm_to_cmip standalone or through zppy. We can chat off line for speeding up testing.

forsyth2 commented 6 months ago

Thanks @tomvothecoder!!

tomvothecoder commented 6 months ago

@tomvothecoder thanks. Yes, addressing this issue would be a high priority for the E3SM-Unified 1.9.3 release. Ryan and I have minimal examples to test e3sm_to_cmip standalone or through zppy. We can chat off line for speeding up testing.

Got it, I'll focus on it tomorrow and let you know how it goes. I do have AWS on Friday, so hopefully I make some good progress by then.

forsyth2 commented 6 months ago

@tomvothecoder If it's useful, I tried doing some more debugging in my testing-updates PR (#554), but it's mostly just giving the same errors I've already described in this issue.

tomvothecoder commented 6 months ago

And I can provide a minimum example that runs e3sm_to_cmip standalone, but won't be able to reproduce the errors.

e3sm_to_cmip --output-path ~/test -v 'lai' --realm lnd --input-path /lcrc/group/e3sm/ac.zhang40/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201_try7/post/lnd/180x360_aave/ts/monthly/2yr --user-metadata ~/zppy/zppy/templates/e3sm_to_cmip/default_metadata.json --tables-path /lcrc/group/e3sm/diagnostics/cmip6-cmor-tables/Tables --num-proc 12

@chengzhuzhang What is the absolute directory for ~/zppy/zppy/templates/e3sm_to_cmip/default_metadata.json? I'm making an e3sm_to_cmip .py script for debugging.

EDIT: Nevermind that's just the file in the zppy repo: https://github.com/E3SM-Project/zppy/blob/main/zppy/templates/e3sm_to_cmip/default_metadata.json, which looks like the same thing as what is in the e3sm_to_cmip repo https://github.com/E3SM-Project/e3sm_to_cmip/blob/master/e3sm_to_cmip/resources/default_metadata.json

chengzhuzhang commented 6 months ago

@tomvothecoder yes! It's my home directory where I git cloned zppy. And the copy was originally taken from e3sm_to_cmip.

tomvothecoder commented 6 months ago

And I can provide a minimum example that runs e3sm_to_cmip standalone, but won't be able to reproduce the errors.

e3sm_to_cmip --output-path ~/test -v 'lai' --realm lnd --input-path /lcrc/group/e3sm/ac.zhang40/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201_try7/post/lnd/180x360_aave/ts/monthly/2yr --user-metadata ~/zppy/zppy/templates/e3sm_to_cmip/default_metadata.json --tables-path /lcrc/group/e3sm/diagnostics/cmip6-cmor-tables/Tables --num-proc 12

@chengzhuzhang just like you, I can't reproduce this issue running e3sm_to_cmip directly. Both the serial and parallel runs of my test script succeed. I can't find the root cause since this issue happens through zppy. I might have a theory and will push something in e3sm_to_cmip to see if it fixes anything.

My Analysis

I noticed that e3sm_to_cmip failures in zppy occur after this log line: 2024-02-20 22:04:45,724_724:INFO:cmorize:lai: creating CMOR variable with CMOR axis objects.. This specific log line occurs after opening multi-file input datasets (.nc files). Here are the related lines of code in e3sm_to_cmip.

The errors mentioned below are related to I/O. Specifically, one of them is raised by Xarray.

RE: NetCDF: Not a valid ID error from this comment

2024-02-20 22:04:45,965_965:ERROR:_run_parallel:0 of 1 handlers complete
NetCDF: Not a valid ID
mv: cannot stat '/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/pr-548-rebased-20240220/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002/CMIP6/CMIP/*/*/*/*/*/*/*/*/*.nc': No such file or directory

I've seen this error before while testing xCDAT by looping over large multi-file datasets and applying computations on them. This error seemed to appear at random (similar to how this test is "flaky"). Jason mentioned it could be Xarray not behaving well with the filesystem.

Possible cause: Maybe there are multiple processes trying to access the same input .nc file(s) at the same time and/or the input file(s) are closed too early or not at all?

Related Xarray GitHub comment about this issue:

When you use a with block with an xarray object, it closes any associated netCDF file when you leave the block. When you type dataset.mean(dim='round') the file is already closed.

For interactive use, it's usually preferable to just write dataset = xarray.open_dataset('snellman.nc') and close it explicitly later when/if desired.

Possible solution: Explicitly closing the xr.Dataset object after each loop in this code block

RE: "Run 2" in the PR description

2024-02-17 00:51:38,329_329:ERROR:_run_parallel:0 of 1 handlers complete
'LAISUN'
mv: cannot stat '/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002/CMIP6/CMIP/*/*/*/*/*/*/*/*/*.nc': No such file or directory

Possible cause: I think this error happens because e3sm_to_cmip fails to call cmor.write() to write the output .nc file(s) needed for this command.

Possible solution: Maybe same as above solution

tomvothecoder commented 6 months ago

I just merged https://github.com/E3SM-Project/e3sm_to_cmip/pull/244. Is there any way to test zppy with the latest master version of e3sm_to_cmip? That way we don't need to make a release for an unconfirmed fix.

forsyth2 commented 6 months ago

Yes, it's possible but annoying. (Because e3sm_to_cmip is not called directly as a task, but rather through the ts task, environment_commands can't be used directly to just run a later version of e3sm_to_cmip. Rather, the bash files have to be manually edited. This tech debt is tracked as #467, but unfortunately I don't expect that to be addressed any time soon).

In any case, I will try to test it out. Thanks for working on this.

forsyth2 commented 6 months ago

Actually looking at https://github.com/E3SM-Project/zppy/pull/466/commits/c939071a3adafb3efb1db5ee889cab52a9c27452, I suppose the easier-to-implement alternative is to just keep e3sm_to_cmip_environment_commands as a parameter. That somewhat pollutes the parameter space, but I guess that's ok.

forsyth2 commented 6 months ago

@tomvothecoder I'm testing your changes over on #556, but I'm getting the following cdms2 error.

$ tail -n 40 ts_land_monthly_1850-1851-0002.o475528 
The pfull variable handler requires both cdms2 and cdutil
Traceback (most recent call last):
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/bin/e3sm_to_cmip", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/__main__.py", line 964, in main
    app = E3SMtoCMIP(args)
          ^^^^^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/__main__.py", line 154, in __init__
    self.handlers = self._get_handlers()
                    ^^^^^^^^^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/__main__.py", line 210, in _get_handlers
    handlers = derive_handlers(
               ^^^^^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/cmor_handlers/utils.py", line 176, in derive_handlers
    handlers_by_var: Dict[str, List[Dict[str, Any]]] = _get_handlers_by_var()
                                                       ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/cmor_handlers/utils.py", line 269, in _get_handlers_by_var
    handlers_from_modules = _get_handlers_from_modules(LEGACY_HANDLER_DIR_PATH)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/cmor_handlers/utils.py", line 365, in _get_handlers_from_modules
    module = _get_handler_module(var, filepath)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/cmor_handlers/utils.py", line 404, in _get_handler_module
    module = SourceFileLoader(module_name, module_path).load_module()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap_external>", line 605, in _check_name_wrapper
  File "<frozen importlib._bootstrap_external>", line 1120, in load_module
  File "<frozen importlib._bootstrap_external>", line 945, in load_module
  File "<frozen importlib._bootstrap>", line 290, in _load_module_shim
  File "<frozen importlib._bootstrap>", line 721, in _load
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/cmor_handlers/vars/pfull.py", line 20, in <module>
    raise e
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240222/lib/python3.11/site-packages/e3sm_to_cmip/cmor_handlers/vars/pfull.py", line 16, in <module>
    import cdms2
ModuleNotFoundError: No module named 'cdms2'
srun: error: chr-0497: task 0: Exited with exit code 1
chengzhuzhang commented 6 months ago

@forsyth2 I believe cdms2 is no longer a e3sm_to_cmip dependency. By looking at the e3sm_to_cmip source code, i think somehow you are using an old version. Could you maybe try install from fresh?

I found some relevant issues: https://github.com/E3SM-Project/e3sm_to_cmip/issues/115

forsyth2 commented 6 months ago

@chengzhuzhang Yes, that's what I thought as well, but I just created a new dev environment off e3sm_to_cmip's latest master. I'll check again, maybe something went wrong with the environment creation.

forsyth2 commented 6 months ago

@chengzhuzhang, No I'm still definitely getting the The pfull variable handler requires both cdms2 and cdutil error.

$ git checkout master
$ git fetch upstream
$ git reset --hard upstream/master
$ git log # Check Tom's latest commit is present
$ mamba clean --all
$ conda env create -f conda-env/dev.yml -n e3sm_to_cmip_20240223 # Did this with `mamba` yesterday
$ conda activate e3sm_to_cmip_20240223 
$ python -m pip install . # Did this with `pip install .` yesterday

That seems to match the directions at https://github.com/E3SM-Project/e3sm_to_cmip?tab=readme-ov-file#2-conda-development-environment-and-source-code. I was reading the thread on #115, but shouldn't that all be resolved now?

chengzhuzhang commented 6 months ago

@forsyth2 I also tried to create a new env and install e3sm_to_cmip. By testing a standalone run, can't reproduce the error:

e3sm_to_cmip --output-path ~/test -v 'lai' --realm lnd --input-path /lcrc/group/e3sm/ac.zhang40/zppy_test_complete_run_output/test-main2-20240216/v2.LR.historical_0201_try7/post/lnd/180x360_aave/ts/monthly/2yr --user-metadata ~/zppy/zppy/templates/e3sm_to_cmip/default_metadata.json --tables-path /lcrc/group/e3sm/diagnostics/cmip6-cmor-tables/Tables --num-proc 12
forsyth2 commented 6 months ago

@chengzhuzhang Interesting, that command also causes the cdms2 error for me.

forsyth2 commented 6 months ago

I wonder if it's possibly related to #490. That's the only thing I can think of re: things working for other people but not me.

chengzhuzhang commented 6 months ago

@tomvothecoder added a Makefile earlier to help make sure a clean installation that make install will clean up cache files and build files before running python -m pip install . – ensures you are installing the latest version of the branch into your dev env. Can you try make install before running python -m pip install .?

forsyth2 commented 6 months ago

I wonder if it's possibly related to https://github.com/E3SM-Project/zppy/issues/490.

I don't think it's that, since that seems to be related to the zppy templateDir, and the problem exists for me using e3sm_to_cmip stand-alone.

Can you try make install before running python -m pip install .?

Yes, let me try that.

forsyth2 commented 6 months ago

@chengzhuzhang thanks, the make install was indeed the missing step. That stand-alone command and the ts task both now work for me.

forsyth2 commented 6 months ago

Resolved by https://github.com/E3SM-Project/e3sm_to_cmip/pull/244 (and by #556 which allowed for testing of the previous PR)

tomvothecoder commented 6 months ago

Resolved by E3SM-Project/e3sm_to_cmip#244 (and by #556 which allowed for testing of the previous PR)

I'm happy to hear that my PR actually fixed the issue. I'll release a new RC for e3sm_to_cmip to get that fix into E3SM-Unified.

forsyth2 commented 6 months ago

Awesome, thanks @tomvothecoder!

forsyth2 commented 4 months ago

@chengzhuzhang @tomvothecoder I ran into the e3sm_to_cmip concurrent/futures/_base.py problem again using E3SM Unified 1.10.0rc1 for a different cfg (not the test suite).

cfg:

[default]
input = /lcrc/group/e3sm2/ac.golaz/E3SMv3/v3.LR.piControl
output = /lcrc/group/e3sm/ac.forsyth2/zppy_p1_output/v3.LR.piControl
case = v3.LR.piControl
www = /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_p1_www
partition = compute
environment_commands = "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.10.0rc1_chrysalis.sh"

[ts]
active = True
walltime = "00:50:00"

  [[ atm_monthly ]]
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  ts_fmt = "cmip"
  years = "0001:0020:5",

  [[ land_monthly ]]
  extra_vars = "landfrac"
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  ts_fmt = "cmip"
  vars = "FSH,RH2M,LAISHA,LAISUN"
  years = "0001:0020:5",

  [[ atm_monthly_glb ]]
  input_subdir = "archive/atm/hist"
  input_files = "eam.h0"
  frequency = "monthly"
  mapping_file = "glb"
  years = "0001:0020:10",

  [[ lnd_monthly_glb ]]
  input_subdir = "archive/lnd/hist"
  input_files = "elm.h0"
  frequency = "monthly"
  mapping_file = "glb"
  vars = "FSH,RH2M,LAISHA,LAISUN"
  years = "0001:0020:10",

Error from /lcrc/group/e3sm/ac.forsyth2/zppy_p1_output/v3.LR.piControl/post/scripts/ts_atm_monthly_0001-0005-0005.o505229:

2024-04-19 19:12:28,039_039:INFO:cmorize:pr: creating CMOR variable with CMOR axis objects.
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0rc1_chrysalis/lib/python3.10/site-packages/e3sm_to_cmip/__main__.py", line 931, in _run_parallel
    out = res.result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0rc1_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0rc1_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
^M 50%|█████     | 1/2 [00:00<00:00,  5.38it/s]  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0rc1_chrysalis/lib/python3.10/site-packages/e3sm_to_cmip/__main__.py", line 931, in _run_parallel
    out = res.result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0rc1_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0rc1_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
^M100%|██████████| 2/2 [00:00<00:00, 10.74it/s]
forsyth2 commented 4 months ago

@chengzhuzhang was able to fix the error by setting mapping_file =map_ne30pg2_to_cmip6_180x360_aave.20200201.nc for ts atmosphere task. Important: e3sm_to_cmip needs regular lat-lon data as input (for atm, land variables)

forsyth2 commented 4 months ago

I created https://github.com/E3SM-Project/e3sm_to_cmip/issues/257