Closed forsyth2 closed 7 months ago
Thanks for the clear description. I've crossed out the commits that I know wouldn't affect e3sm_diags between v2.9.0 and v2.10.0.
Can you provide me the standalone command for the e3sm_diags task? I will try stepping through the code on v2.10.0 up to where it breaks.
Can you provide me the standalone command for the e3sm_diags task?
Hmm I'm trying to figure out exactly how to condense that down. There's a lot of auto-generation (and NCO dependencies) that come first.
The relevant parts of the cfg
(excluding any climo
/ts
dependencies) would be:
So, that ends up generating a bash
file, like this:
So, the command is really srun -n 1 python -u e3sm.py
, but there's a whole auto-generated e3sm.py
.
Ah, ok, I think this would be the e3sm.py
to try out. You still might need to run NCO to pre-process the data (i.e., what the ts
task dependency does) though.
The other tests for zppy
are fine. (Except for the bundles run, which also runs into the ILAMB error. That test doesn't check the environment_commands
setting, so it wouldn't catch the other error).
Re: the ILAMB error, I ran with environment_commands
set to use the E3SM Unified 1.9.1 version of ILAMB, and it worked fine. So, something changed on ILAMB between versions 2.6 and 2.7.
(Note that my note on https://github.com/E3SM-Project/zppy/issues/523#issuecomment-1847672126 is to see if ILAMB 2.7 expands the zppy output and/or allows a simpler cfg. I can't really check that until I have 2.7 working in the first place).
1b
I then changed
tests/integration/utils.py
to use the version of E3SM Diags that was used in the other E3SM Diags jobs for this run. That is,"diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.9.2rc2_chrysalis.sh"
, which is the sameenvironment_commands
all the other jobs used. Looking at acme-climate.atlassian.net/wiki/spaces/DOC/pages/129732419/Packages+in+the+E3SM+Unified+conda+environment#e3sm-unified-1.9.2, that looks like that would be E3SM Diags v2.10.0 (E3SM-Project/e3sm_diags@0b7f9c7).
Okay I figured out the root cause of the e3sm_diags
results_dir
issue.
This commit E3SM-Project/e3sm_diags@48426aa
(#755) changes if type(p) == cls_type
to if isinstance(p, cls_type)
. Apparently, both of these conditionals are not the same (source) and can return different boolean values. I changed if type(p) == cls_type
because it is considered bad practice (Flake8 E721).
As a result, in v2.10.0, the results_dir
config is not being copied from the first parameter (param
) to the second parameter (qbo_param
). This causes results_dir
to be blank which cascades to the FileNotFoundError: [Errno 2] No such file or directory: ''
and FileNotFoundError: [Errno 2] No such file or directory: '/prov/e3sm_diags_run.log'
.
The fix is to change if isinstance(p, cls_type)
to if type(p) is cls_type
. I will open a separate PR for this and get a new e3sm_diags RC release out.
This brings my idea again about testing these tools together, outside of E3SM Unified. We should consider more frequent releases and periodic testing before E3SM Unified releases. It would really cut down potential bugs appearing with E3SM Unified releases at the last second. It's not a good idea to try to rush out new package releases for "emergency" E3SM Unified releases, especially if the packages have a lot of changes.
Also, the way e3sm_diags
copies parameter attributes around from different sources (parameter objects, parsers, cfg, core_parameter, etc.) is really convoluted and seems unnecessarily complex. I wish it wasn't so fragile and hard to work with.
1a
tests/integration/utils.py
had"diags_environment_commands": "source /home/ac.forsyth2/miniconda3/etc/profile.d/conda.sh; conda activate e3sm_diags_20231221"
, meaning it ran using a conda dev environment built off the latestmain
of E3SM Diags (E3SM-Project/e3sm_diags@9e14ff8)===== RUN E3SM DIAGS ===== 2023-12-21 14:25:12,848 [ERROR]: run.py(run_diags:90) >> Error traceback: Traceback (most recent call last): File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\ iags/run.py", line 88, in run_diags params_results = main(params) File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\ iags/e3sm_diags_driver.py", line 363, in main os.makedirs(parameters[0].results_dir, 0o755) File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/os.py", line 225, in\ makedirs mkdir(name, mode) FileNotFoundError: [Errno 2] No such file or directory: '' Traceback (most recent call last): File "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.histori\ cal_0201/post/scripts/tmp.447022.wCBq/e3sm.py", line 53, in <module> runner.run_diags(params) File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\ iags/run.py", line 92, in run_diags move_log_to_prov_dir(params_results[0].results_dir) UnboundLocalError: local variable 'params_results' referenced before assignment [WARNING] yaksa: 10 leaked handle pool objects srun: error: chr-0496: task 0: Exited with exit code 1
Since this case technically tests an unreleased version of E3SM Diags, I suppose this is fine to ignore for now.
Just an FYI that this run fails in v2.9.0 too, but the log file saves which makes it seems like it was working. UPDATE: Actually I didn't use NCO so it might be failing when I use the test e3sm.py
script as a result.
On main
, it fails but the log file does not save (UnboundLocalError: local variable 'params_results' referenced before assignment
)
v2.9.0 output -- log file saves
2023-12-22 17:10:18,864 [INFO]: run.py(_add_parent_attrs_to_children:152) >> ['diff_title', 'short_test_name', 'num_workers', 'results_dir']
2023-12-22 17:10:25,355 [INFO]: e3sm_diags_driver.py(_save_env_yml:59) >> Saved environment yml file to: model_vs_obs_1850-1851/prov/environment.yml
2023-12-22 17:10:25,356 [INFO]: e3sm_diags_driver.py(_save_parameter_files:70) >> Saved command used to: model_vs_obs_1850-1851/prov/cmd_used.txt
2023-12-22 17:10:25,358 [INFO]: e3sm_diags_driver.py(_save_python_script:134) >> Saved Python script to: model_vs_obs_1850-1851/prov/ipykernel_launcher.py
2023-12-22 17:10:26,954 [ERROR]: e3sm_diags_driver.py(run_diag:296) >> Error in e3sm_diags.driver.qbo_driver
Traceback (most recent call last):
File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/e3sm_diags_driver.py", line 293, in run_diag
single_result = module.run_diag(parameter)
File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/driver/qbo_driver.py", line 173, in run_diag
test_var = test_data.get_timeseries_variable(variable)
File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/driver/utils/dataset.py", line 98, in get_timeseries_variable
variables = self._get_timeseries_var(data_path, *args, **kwargs)
File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/driver/utils/dataset.py", line 469, in _get_timeseries_var
vars_to_func_dict = self._get_first_valid_vars_timeseries(
File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/driver/utils/dataset.py", line 560, in _get_first_valid_vars_timeseries
raise RuntimeError(msg)
RuntimeError: Neither does U nor the variables in [('ua',), ('U',)] have valid files in ts.
2023-12-22 17:10:26,959 [WARNING]: e3sm_diags_driver.py(main:426) >> There was not a single valid diagnostics run, no viewer created.
2023-12-22 17:10:26,960 [ERROR]: run.py(run_diags:37) >> Error traceback:
Traceback (most recent call last):
File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/run.py", line 35, in run_diags
main(final_params)
File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/e3sm_diags_driver.py", line 445, in main
if parameters_results[0].fail_on_incomplete and (
IndexError: list index out of range
2023-12-22 17:10:26,965 [INFO]: logger.py(move_log_to_prov_dir:106) >> Log file saved in model_vs_obs_1850-1851/prov/e3sm_diags_run.log
As a result, in v2.10.0, the
results_dir
config is not being copied from the first parameter (param
) to the second parameter (qbo_param
).
Wow, that is a very insidious bug. So, it's just returning p
right away because qbo_param
is-a param
? But we really want it to return p
when we get to the exact type?
I will open a separate PR for this and get a new e3sm_diags RC release out.
Awesome, thanks!!
testing these tools together, outside of E3SM Unified.
I absolutely agree. I should update the testing process as follows:
environment_commands
for the entire E3SM Diags section to use a dev environment created off the latest E3SM Diags main
branch. Then, zppy
will always test the latest diags. [#539]It's not a good idea to try to rush out new package releases for "emergency" E3SM Unified releases, especially if the packages have a lot of changes.
That's true. https://nvie.com/posts/a-successful-git-branching-model/ suggests merging patches into the user-facing releases AND the latest development branch. That is, only the bug fixes should be getting merged into user-facing code between the non-patch releases; we shouldn't be doing whole new releases of packages according to this particular workflow ideal.
the way e3sm_diags copies parameter attributes around from different sources (parameter objects, parsers, cfg, core_parameter, etc.) is really convoluted and seems unnecessarily complex.
I agree it's very convoluted. I haven't studied it enough to know if there's a simpler way to accomplish the same thing. zppy
has similar issues of parameter convolution (i.e., parameters being defined in different hierarchical sections of the cfg
and some parameters being introduced internally via the <task>.py
and the <task>.bash
templates. The .settings
file at least shows the final values for each parameter a job uses).
I want to make sure I'm clear on how param_results
is used. It sounds similar to param.fail_on_incomplete
.
The latter, when set to True
, will fail E3SM Diags if any set didn't complete. Otherwise, it will succeed as long as something makes it into the viewer.
It sounds like the former does something similar, but is not a parameter passed in by a user. It will have a value if all sets completed and otherwise will not. Is that right?
Wow, that is a very insidious bug. So, it's just returning
p
right away becauseqbo_param
is-aparam
? But we really want it to returnp
when we get to the exact type?
Yeah we only want to return p
if it is exact same type and not a sub-class/sub-type.
https://github.com/conda-forge/e3sm_diags-feedstock/commit/79944cdcbe340cfe4af74cb7c44d82dc35d0f16d
I want to make sure I'm clear on how
param_results
is used. It sounds similar toparam.fail_on_incomplete
.The latter, when set to
True
, will fail E3SM Diags if any set didn't complete. Otherwise, it will succeed as long as something makes it into the viewer.It sounds like the former does something similar, but is not a parameter passed in by a user. It will have a value if all sets completed and otherwise will not. Is that right?
params_results
is different. It's just a variable containing params
but AFTER successful diagnostic runs. If diagnostic runs fail, params_results
is never set which causes UnboundLocalError: local variable 'params_results' referenced before assignment
. This causes e3sm_diags
to crash. param.fail_on_incomplete
sounds like it will stop e3sm_diags
on the first instance of a failure.
Notice in the code below that params_results
has no default value and does not get assigned if the try:
statement failed.
params = self.get_run_parameters(parameters, use_cfg)
if params is None or len(params) == 0:
raise RuntimeError(
"No parameters we able to be extracted. Please "
"check the parameters you defined."
)
try:
params_results = main(params)
except Exception:
logger.exception("Error traceback:", exc_info=True)
move_log_to_prov_dir(params_results[0].results_dir)
return params_results
I updated this logic in https://github.com/E3SM-Project/e3sm_diags/pull/770/files so that params_results
has a default value (None
) and move_log_to_prov_dir()
uses params
(not params_results
because it can be None
with failed runs).
e3sm_diags v2.10.1rc1 is now released with these fixes: https://github.com/conda-forge/e3sm_diags-feedstock/commit/79944cdcbe340cfe4af74cb7c44d82dc35d0f16d
Great, thanks @tomvothecoder!
Re: Error 2, I made https://github.com/rubisco-sfa/ILAMB/issues/85.
@forsyth2 and @chengzhuzhang, it seems like we probably need to run ilamb-fetch
on Chrysalis (or Anvil) in the appropriate directory, probably:
/lcrc/group/e3sm/diagnostics/ilamb_data
Re: Error 2, the issue does appear to be from not running ilamb-fetch
. See https://github.com/E3SM-Project/zppy/discussions/541
Request criteria
Issue description
Testing
zppy
on Chrysalis, using E3SM Unified 1.9.2rc2, I run into the following errors on thecomplete_run
run. The errors appear to be similar on Perlmutter. Please note that there is not a newzppy
release for E3SM Unified 1.9.2. That is, these errors are occuring on azppy
version that was previously tested (for E3SM Unified 1.9.1).Error 1
e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_*
:This is the job that makes sure the
environment_commands
parameter is working properly.1a
tests/integration/utils.py
had"diags_environment_commands": "source /home/ac.forsyth2/miniconda3/etc/profile.d/conda.sh; conda activate e3sm_diags_20231221"
, meaning it ran using a conda dev environment built off the latestmain
of E3SM Diags (https://github.com/E3SM-Project/e3sm_diags/commit/9e14ff81a658208631cda54516b4af67ae7541f3)Since this case technically tests an unreleased version of E3SM Diags, I suppose this is fine to ignore for now.
1b
I then changed
tests/integration/utils.py
to use the version of E3SM Diags that was used in the other E3SM Diags jobs for this run. That is,"diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.9.2rc2_chrysalis.sh"
, which is the sameenvironment_commands
all the other jobs used. Looking at https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/129732419/Packages+in+the+E3SM+Unified+conda+environment#e3sm-unified-1.9.2, that looks like that would be E3SM Diags v2.10.0 (https://github.com/E3SM-Project/e3sm_diags/commit/0b7f9c7f7581ceaed260c3baa5104f8e448ffd3f).Since this case tests the E3SM Diags version that is included in the upcoming Unified release, we should address this error.
1c
I then changed
tests/integration/utils.py
to use the version of E3SM Diags that was used in the other E3SM Diags jobs for this run. That is,"diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh"
, which uses the latest official release of E3SM Unified. Looking at https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/129732419/Packages+in+the+E3SM+Unified+conda+environment#e3sm-unified-1.9.1, that looks like that would be E3SM Diags v2.9.0 (https://github.com/E3SM-Project/e3sm_diags/commit/a2d00eb114cf505a4ecf4dd429bb8710778cc087).This works fine. That is expected since
zppy
had previously tested this version of E3SM Diags when we did the release for E3SM Unified 1.9.1.Potential sources of the bugs
1c -> 1b bug:
1b bug -> 1a bug:
params_results
is not set (only set if successful)Error 2
ilamb_*
:I'm not sure if this is a bug with ILAMB itself or if there is simply a missing dataset (e.g., a dataset was deleted or the new version of ILAMB requires one that I'm not pointing to).