CDAT Migration: Refactor annual_cycle_zonal_mean set

chengzhuzhang commented 5 months ago

Description

Refactor annual_cycle_zonal_mean with xarray/xcdat Driver is pretty short and has unique _create_annual_cycle function

Closes #669

Checklist

[ ] My code follows the style guidelines of this project
[ ] I have performed a self-review of my own code
[ ] My changes generate no new warnings
[ ] Any dependent changes have been merged and published in downstream modules

If applicable:

[ ] New and existing unit tests pass with my changes (locally and CI/CD build)
[ ] I have added tests that prove my fix is effective or that my feature works
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[ ] I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

chengzhuzhang commented 4 months ago

Basic driver and plotting scripts are working. Through only with MULTIPROCESSING=FALSE, if switching it on, I hit errors as follows. It appears from ds = xc.open_mfdataset(**args) which is newly added to ready multi-months data to concatenate into annual cycle time series.

Traceback (most recent call last):
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_comm.py", line 493, in start_client
    s.connect((host, port))
TimeoutError: timed out
2024-03-22 15:14:20,314 [ERROR]: core_parameter.py(_run_diag:341) >> Error in e3sm_diags.driver.annual_cycle_zonal_mean_driver
Traceback (most recent call last):
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/parameter/core_parameter.py", line 338, in _run_diag
    single_result = module.run_diag(self)
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 68, in run_diag
    ds_test = test_ds.get_climo_dataset(var_key, "ANNUALCYCLE")
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py", line 365, in get_climo_dataset
    ds = self._get_climo_dataset(season)
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py", line 393, in _get_climo_dataset
    ds = self._open_annual_cycle_climo_dataset(filepath)
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py", line 425, in _open_annual_cycle_climo_dataset
    ds = xc.open_mfdataset(**args)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xcdat/dataset.py", line 277, in open_mfdataset
    ds = xr.open_mfdataset(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/backends/api.py", line 1053, in open_mfdataset
    combined = combine_by_coords(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 958, in combine_by_coords
    concatenated_grouped_by_data_vars = tuple(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 959, in <genexpr>
    _combine_single_variable_hypercube(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 630, in _combine_single_variable_hypercube
    concatenated = _combine_nd(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 232, in _combine_nd
    combined_ids = _combine_all_along_first_dim(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 267, in _combine_all_along_first_dim
    new_combined_ids[new_id] = _combine_1d(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 290, in _combine_1d
    combined = concat(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/concat.py", line 252, in concat
    return _dataset_concat(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/concat.py", line 526, in _dataset_concat
    merged_vars, merged_indexes = merge_collected(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/merge.py", line 290, in merge_collected
    merged_vars[name] = unique_variable(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/merge.py", line 137, in unique_variable
    out = out.compute()
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/variable.py", line 547, in compute
    return new.load(**kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/variable.py", line 520, in load
    loaded_data, *_ = chunkmanager.compute(self._data, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/daskmanager.py", line 70, in compute
    return compute(*data, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/dask/base.py", line 628, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/context.py", line 281, in _Popen
    return Popen(process_obj)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 956, in new_fork
    _on_forked_process(setup_tracing=apply_arg_patch and not is_subprocess_fork)
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 232, in _on_forked_process
    pydevd.settrace_forked(setup_tracing=setup_tracing)
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3134, in settrace_forked
    settrace(
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2821, in settrace
    _locked_settrace(
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2902, in _locked_settrace
    py_db.connect(host, port)  # Note: connect can raise error.
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 1421, in connect
    s = start_client(host, port)
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_comm.py", line 493, in start_client
    s.connect((host, port))
TimeoutError: timed out
80.67s - Could not connect to 127.0.0.1: 49425

chengzhuzhang commented 4 months ago

Current results with one variable: https://portal.nersc.gov/cfs/e3sm/cdat-migration-fy24/669-annual_cycle_zonal_mean/viewer/

Other TODO items:

refine axis config for plot
fix viewer
Verify all variable runs

tomvothecoder commented 4 months ago

Basic driver and plotting scripts are working. Through only with MULTIPROCESSING=FALSE, if switching it on, I hit errors as follows. It appears from ds = xc.open_mfdataset(**args) which is newly added to ready multi-months data to concatenate into annual cycle time series.

Traceback (most recent call last):
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_comm.py", line 493, in start_client
    s.connect((host, port))
TimeoutError: timed out
2024-03-22 15:14:20,314 [ERROR]: core_parameter.py(_run_diag:341) >> Error in e3sm_diags.driver.annual_cycle_zonal_mean_driver
Traceback (most recent call last):
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/parameter/core_parameter.py", line 338, in _run_diag
    single_result = module.run_diag(self)
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 68, in run_diag
    ds_test = test_ds.get_climo_dataset(var_key, "ANNUALCYCLE")
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py", line 365, in get_climo_dataset
    ds = self._get_climo_dataset(season)
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py", line 393, in _get_climo_dataset
    ds = self._open_annual_cycle_climo_dataset(filepath)
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py", line 425, in _open_annual_cycle_climo_dataset
    ds = xc.open_mfdataset(**args)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xcdat/dataset.py", line 277, in open_mfdataset
    ds = xr.open_mfdataset(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/backends/api.py", line 1053, in open_mfdataset
    combined = combine_by_coords(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 958, in combine_by_coords
    concatenated_grouped_by_data_vars = tuple(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 959, in <genexpr>
    _combine_single_variable_hypercube(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 630, in _combine_single_variable_hypercube
    concatenated = _combine_nd(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 232, in _combine_nd
    combined_ids = _combine_all_along_first_dim(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 267, in _combine_all_along_first_dim
    new_combined_ids[new_id] = _combine_1d(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 290, in _combine_1d
    combined = concat(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/concat.py", line 252, in concat
    return _dataset_concat(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/concat.py", line 526, in _dataset_concat
    merged_vars, merged_indexes = merge_collected(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/merge.py", line 290, in merge_collected
    merged_vars[name] = unique_variable(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/merge.py", line 137, in unique_variable
    out = out.compute()
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/variable.py", line 547, in compute
    return new.load(**kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/variable.py", line 520, in load
    loaded_data, *_ = chunkmanager.compute(self._data, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/daskmanager.py", line 70, in compute
    return compute(*data, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/dask/base.py", line 628, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/context.py", line 281, in _Popen
    return Popen(process_obj)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 956, in new_fork
    _on_forked_process(setup_tracing=apply_arg_patch and not is_subprocess_fork)
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 232, in _on_forked_process
    pydevd.settrace_forked(setup_tracing=setup_tracing)
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3134, in settrace_forked
    settrace(
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2821, in settrace
    _locked_settrace(
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2902, in _locked_settrace
    py_db.connect(host, port)  # Note: connect can raise error.
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 1421, in connect
    s = start_client(host, port)
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_comm.py", line 493, in start_client
    s.connect((host, port))
TimeoutError: timed out
80.67s - Could not connect to 127.0.0.1: 49425

This issue seems to be related to these:

Performance
- Related Xarray issue: https://github.com/pydata/xarray/issues/1385#issuecomment-1958761334
- Related xCDAT issue: https://github.com/xCDAT/xcdat/issues/641
- Workaround: use data_var="minimal", "coords="minimal" and compat=override
Conflicts with multiprocessing scheduler using context of fork when calling to_netcdf()
- https://github.com/pydata/xarray/issues/3781
- https://github.com/E3SM-Project/e3sm_diags/blob/f6c4fdfeabb6765c330b93bee8d18063efb1f9a6/e3sm_diags/e3sm_diags_driver.py#L304-L305
- Workaround: If using open_mfdataset(), call .load(scheduler="sync")

I'm currently debugging and will push fixes.

tomvothecoder commented 1 month ago

@chengzhuzhang you can pick this set back up. I did not make any progress since our last meeting on 4/15/24 (notes). Specifically, there is still a problem related to:

multiprocessing = True threw timeout error, fixed by loading multi-file dataset into memory (conflicts with Dask multiprocessing scheduler)

chengzhuzhang commented 1 month ago

viewer is fixed in 322

I can confirm with mutiprocessing on, it still ran into error:

2024-07-10 11:27:10,547 [ERROR]: run.py(run_diags:91) >> Error traceback:
Traceback (most recent call last):
File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/run.py", line 89, in run_diags
params_results = main(params)
File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/e3sm_diags_driver.py", line 371, in main
parameters_results = _run_with_dask(parameters)
File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/e3sm_diags_driver.py", line 316, in _run_with_dask
results = bag.map(CoreParameter._run_diag).compute(num_workers=num_workers)
File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/dask/base.py", line 342, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/dask/base.py", line 628, in compute
results = schedule(dsk, keys, **kwargs)
File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

A full run with all variables running in series also stopped midway

Errors also occur with 3 variables which are data specific:


2024-07-10 12:29:55,272 [INFO]: annual_cycle_zonal_mean_driver.py(run_diag:56) >> Variable: SCO
2024-07-10 12:30:46,299 [INFO]: annual_cycle_zonal_mean_driver.py(_run_diags_annual_cycle:124) >> Selected region: global
2024-07-10 12:30:50,654 [ERROR]: core_parameter.py(_run_diag:341) >> Error in e3sm_diags.driver.annual_cycle_zonal_mean_driver
TypeError: float() argument must be a string or a real number, not 'tuple'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/parameter/core_parameter.py", line 338, in _run_diag single_result = module.run_diag(self) File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 76, in run_diag _run_diags_annual_cycle( File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 142, in _run_diags_annual_cycle test_zonal_mean = test_zonal_mean.sel(lat=(-60, 60)) File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1617, in sel ds = self._to_temp_dataset().sel( File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/dataset.py", line 3074, in sel query_results = map_index_queries( File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/indexing.py", line 193, in map_index_queries results.append(index.sel(labels, **options)) File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/indexes.py", line 748, in sel label_array = normalize_label(label, dtype=self.coord_dtype) File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/indexes.py", line 545, in normalize_label value = np.asarray(value, dtype=dtype) ValueError: setting an array element with a sequence. 2024-07-10 12:30:50,730 [INFO]: annual_cycle_zonal_mean_driver.py(run_diag:56) >> Variable: TCO 2024-07-10 12:31:24,528 [INFO]: annual_cycle_zonal_mean_driver.py(_run_diags_annual_cycle:124) >> Selected region: global 2024-07-10 12:31:26,916 [ERROR]: core_parameter.py(_run_diag:341) >> Error in e3sm_diags.driver.annual_cycle_zonal_mean_driver TypeError: float() argument must be a string or a real number, not 'tuple'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/parameter/core_parameter.py", line 338, in _run_diag single_result = module.run_diag(self) File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 76, in run_diag _run_diags_annual_cycle( File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 142, in _run_diags_annual_cycle test_zonal_mean = test_zonal_mean.sel(lat=(-60, 60)) File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1617, in sel ds = self._to_temp_dataset().sel( File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/dataset.py", line 3074, in sel query_results = map_index_queries( File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/indexing.py", line 193, in map_index_queries results.append(index.sel(labels, *options)) File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/indexes.py", line 748, in sel label_array = normalize_label(label, dtype=self.coord_dtype) File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/indexes.py", line 545, in normalize_label value = np.asarray(value, dtype=dtype) ValueError: setting an array element with a sequence. 2024-07-10 12:31:26,916 [INFO]: annual_cycle_zonal_mean_driver.py(run_diag:56) >> Variable: SST 2024-07-10 12:32:03,765 [INFO]: annual_cycle_zonal_mean_driver.py(_run_diags_annual_cycle:124) >> Selected region: global 2024-07-10 12:32:06,626 [INFO]: io.py(_write_to_netcdf:134) >> 'SST' test variable output saved in: /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean/annual_cycle_zonal_mean/SST_CL_HadISST/HadISST_CL-SST-ANNUALCYCLE-global_test.nc 2024-07-10 12:32:06,778 [INFO]: io.py(_write_to_netcdf:134) >> 'SST' ref variable output saved in: /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean/annual_cycle_zonal_mean/SST_CL_HadISST/HadISST_CL-SST-ANNUALCYCLE-global_ref.nc 2024-07-10 12:32:06,783 [INFO]: io.py(_write_to_netcdf:134) >> 'SST' diff variable output saved in: /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean/annual_cycle_zonal_mean/SST_CL_HadISST/HadISST_CL-SST-ANNUALCYCLE-global_diff.nc 2024-07-10 12:32:06,783 [INFO]: io.py(_save_data_metrics_and_plots:66) >> Metrics saved in /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean/annual_cycle_zonal_mean/SST_CL_HadISST/HadISST_CL-SST-ANNUALCYCLE-global.json 2024-07-10 12:32:07,551 [ERROR]: core_parameter.py(_run_diag:341) >> Error in e3sm_diags.driver.annual_cycle_zonal_mean_driver Traceback (most recent call last): File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/parameter/core_parameter.py", line 338, in _run_diag single_result = module.run_diag(self) File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 76, in run_diag _run_diags_annual_cycle( File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 167, in _run_diags_annual_cycle _save_data_metrics_and_plots( File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/io.py", line 81, in _save_data_metrics_and_plots plot_func(args) File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/plot/annual_cycle_zonal_mean_plot.py", line 67, in plot _add_colormap( File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/plot/annual_cycle_zonal_mean_plot.py", line 112, in _add_colormap var = var.transpose("lat", "time") File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/dataarray.py", line 3022, in transpose dims = tuple(utils.infix_dims(dims, self.dims, missing_dims)) File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/utils.py", line 814, in infix_dims existing_dims = drop_missing_dims(dims_supplied, dims_all, missing_dims) File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/utils.py", line 906, in drop_missing_dims raise ValueError( ValueError: Dimensions {'lat'} do not exist. Expected one or more of ('time', 'latitude')

chengzhuzhang commented 1 month ago

When set multi-processing=True, error concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. remains. Without loading dataset i.e. ds.load(scheduler="sync"), it has a TimeoutError.

chengzhuzhang commented 1 month ago

More update: The TimeOut error came from driver/utils/regrid.py

   ds_a_regrid = ds_a_new.regridder.horizontal(
        var_key, output_grid, tool=tool, method=method
    )

chengzhuzhang commented 1 month ago

https://github.com/E3SM-Project/e3sm_diags/pull/798/commits/f0a80c92109cd695533a10cca9943060679c98e0 and https://github.com/E3SM-Project/e3sm_diags/pull/798/commits/15811b8a120afc8533572930294b1f252459362c together address the the error: multiprocessing = True threw timeout error or concurrent.futures.process.BrokenProcessPool. It is fixed by first reduce dataset size and then load multi-file dataset into memory (the later is because Python's multiprociessing conflicts with Dask multiprocessing scheduler)
Regression testing caught an error in main which is addressed by https://github.com/E3SM-Project/e3sm_diags/pull/822
The regression results mostly matched except for the AODVIS variable, which the development branch use test data as both test and ref plots.

tomvothecoder commented 1 month ago

@chengzhuzhang Can you check if the file diff is correct? I accidentally rebased the branch on main then had to fix it to rebase on cdat-migration-fy24.

chengzhuzhang commented 1 month ago

@tomvothecoder the file diff looks okay, not sure if it's an exact match. Would it be useful if I force push my local change to remote? I also fixed the last issue with the AODVIS variable.

chengzhuzhang commented 1 month ago

@tomvothecoder I forced pushed in the update for AODVIS fix, which suggested the operations you had earlier was fine. And this PR should be ready for review!

tomvothecoder commented 1 month ago

@tomvothecoder I forced pushed in the update for AODVIS fix, which suggested the operations you had earlier was fine. And this PR should be ready for review!

Your force push overwrote my initial rebase and reintroduced the merge conflicts, which is okay. I squashed all of the commits into a single commit then rebased on cdat-migration-fy24.

tomvothecoder commented 1 month ago

Next steps:

[x] Re-run lat_lon on this branch to ensure nothing breaks with code change to subset climo variables
[x] Make sure integration tests pass

tomvothecoder commented 1 month ago

I re-ran lat_lon and performed the regression test with commit 863dce3 (#798). The results are good like before and the CI/CD build is passing.

Did you update re-run the set and regression test with the ADOVIS fix? I'm still seeing diffs with AODVIS and also ALBEDO.

The only thing left is addressing my comment above.

chengzhuzhang commented 1 month ago

I re-ran lat_lon and performed the regression test with commit 863dce3 (#798). The results are good like before and the CI/CD build is passing.

Did you update re-run the set and regression test with the ADOVIS fix? I'm still seeing diffs with AODVIS and also ALBEDO.

The only thing left is addressing my comment above.

Thank you for testing again! This is great news that the lat-lon regression test still pass. I will address the rest of the issues including AODVIS and ALBEDO. I may have missed this with mixed ref test comparison.

chengzhuzhang commented 1 month ago

In addressing the diff from regression test. I found that for climatology files generated with NCO, When using open_mfdataset with decode_times = False, the time dimension can be reordered or with time skipped unexpectedly. example:

import xcdat as xc
# example data 1: return 2 months of data but expect 12 months.
filepath = '/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/MERRA2_Aerosols/MERRA2_Aerosols_[0-1][0-9]_*climo.nc'
# example data 2: returns 12 months of data, but time coordinates reordered. 
filepath = '/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/AOD_550/AOD_550_[0-1][0-9]_*climo.nc'
args = {
            "paths": filepath,
            "decode_times": False,
            "add_bounds": ["X", "Y"],
            "coords": "minimal",
            "compat": "override",
            "chunks": "auto",
        }

ds = xc.open_mfdataset(**args)
ds

@tomvothecoder Not sure if you have some experience with this. Otherwise, I'm considering to explicitly concatenate data in time to ensure each climatology months in order (pseudocode below), though not entirely sure if the operations are lazy...

datasets = []
for path in paths:
    ds = xc.open_dataset(path)
    datasets.append(ds)
combined = xr.concat(datasets, dim='time')

chengzhuzhang commented 1 month ago

Including following code should work, though we need to think about how best to replace the open_mfdataset() call within dataset_xr.py

import glob
import xarray as xr
paths = sorted(glob.glob(args["paths"]))
ds_annual_cycle = []
for path in paths:
    print(path)
    ds_mon = xc.open_dataset(path, decode_times = False)
    ds_annual_cycle.append(ds_mon)
ds = xr.concat(ds_annual_cycle, dim='time')

tomvothecoder commented 1 month ago

In addressing the diff from regression test. I found that for climatology files generated with NCO, When using open_mfdataset with decode_times = False, the time dimension can be reordered or with time skipped unexpectedly. example:

I will try debugging to figure out the root cause of this issue.

tomvothecoder commented 1 month ago

Issue 1 - Return 2 months of data but expect 12 months

Including following code should work, though we need to think about how best to replace the open_mfdataset() call within dataset_xr.py
import glob
import xarray as xr
paths = sorted(glob.glob(args["paths"]))
ds_annual_cycle = []
for path in paths:
    print(path)
    ds_mon = xc.open_dataset(path, decode_times = False)
    ds_annual_cycle.append(ds_mon)
ds = xr.concat(ds_annual_cycle, dim='time')

Your implementation above addresses a model-specific data quality issue for that multi-file dataset. Your workaround is a possible option, but I think we might be able to implement a cleaner solution to handle this rare edge case.

Root cause: Dataset quality issue related to units being different across different, but the raw time values overlap. This causes time coordinate to collapse. When Xarray/xCDAT joins multiple datasets, it expects the datasets to have the same time units and correct time values relative to the units.

from glob import glob

import xarray as xr
import xcdat as xc

args = {
    "decode_times": False,
    "add_bounds": ["X", "Y"],
    "coords": "minimal",
    "compat": "override",
    "chunks": "auto",
}

filepath = '/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/MERRA2_Aerosols/MERRA2_Aerosols_[0-1][0-9]_*climo.nc'
paths = sorted(glob(filepath))

# filepath 1: '/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/MERRA2_Aerosols/MERRA2_Aerosols_01_198001_202101_climo.nc'
ds_fp1 = xc.open_mfdataset(paths[0], **args)
# filepath 2: '/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/MERRA2_Aerosols/MERRA2_Aerosols_02_198002_202102_climo.nc'
ds_fp2 = xc.open_mfdataset(paths[1], **args)

# Check the units -- different
# ----------------------------
# 'minutes since 1980-01-01 00:30:00'
ds_fp1.time.attrs["units"]
# 'minutes since 1980-02-01 00:30:00'
ds_fp2.time.attrs["units"]

# Check the time values -- same
# ----------------------------
# 10782720
ds_fp1.time.values[0]
# 10782720
ds_fp2.time.values[0]

Workaround/Fix: TBD, still working on this. We've come across this issue before while developing xCDAT. I need to find what the workaround was. Setting decode_times=True works, but downstream operations expect decode_times=False.
- In CDAT, the _create_annual_cycle() function replaces each dataset's time coordinate with a single integer representing the month. This works around the issue where datasets have different time units with overlapping time coordinates:
- https://github.com/E3SM-Project/e3sm_diags/blob/ff43931adac3dc0cc0471106760508ea28d3b8fe/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py#L133-L168

Issue 2 - Returns 12 months of data, but time coordinates reordered.

The time coordinates are re-ordered to be ascending, which I believe is the expected behavior. The units are all the same ("days since 2000-03-01") and the raw time coordinates are relative to the units.

# %%
from glob import glob

import xarray as xr
import xcdat as xc

args = {
    "decode_times": False,
    "add_bounds": ["X", "Y"],
    "coords": "minimal",
    "compat": "override",
    "chunks": "auto",
}

# %%
filepath = "/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/AOD_550/AOD_550_[0-1][0-9]_*climo.nc"
filepaths = glob(filepath)
ds = xc.open_mfdataset(filepaths, **args)

# array([ 15.5101,  46.03  ,  76.5498, 107.57  , 137.5895, 168.6097,
#       198.6292, 229.6494, 259.6689, 290.6891, 321.7142, 351.2334])
ds.time.values

# %%
ds1 = xc.open_mfdataset(filepaths[0], **args)
ds2 = xc.open_mfdataset(filepaths[1], **args)

# 'days since 2000-03-01'
ds1.time.units
# 'days since 2000-03-01'
ds2.time.units

chengzhuzhang commented 1 month ago

@tomvothecoder Thank you for troubleshooting. I was testing both datasets for the same variable, but the example 2 dataset should be retired (I replaced this dataset in lat_lon, but missed this instance in annual_cycle_zonal_mean). As you pointed out that the created datasets is correct, the first time step gives March mean. We could have a fix to align time (that should fix the plot which have x axis/ticks start from January). Since this dataset is retired, I think we should just focus on example 1 for now. (I should remember to update the main branch with new data in .cfg)

tomvothecoder commented 1 month ago

@tomvothecoder Thank you for troubleshooting. I was testing both datasets for the same variable, but the example 2 dataset should be retired (I replaced this dataset in lat_lon, but missed this instance in annual_cycle_zonal_mean). As you pointed out that the created datasets is correct, the first time step gives March mean. We could have a fix to align time (that should fix the plot which have x axis/ticks start from January). Since this dataset is retired, I think we should just focus on example 1 for now. (I should remember to update the main branch with new data in .cfg)

I just pushed a fix to issue 1 in this commit: 159cdf5 (#798).

It involves setting decode_times=True to properly concatenate time coordinates. I found that no downstream operations are affected with this change except the annual_cycle_zonal_mean plotter which uses the time coordinates for plotting. I had to update the plotter to extract the months to use as X axis values.

Also, I updated the comment above describing how CDAT replaces time coordinates with month integers in _create_annual_cycle() as a workaround to this issue.

chengzhuzhang commented 4 weeks ago

@tomvothecoder When testing with decode_times = False, I found that for example one, the decoded_time is just not right. For instance, for the January mean climatology file, the time was decoded as time (time) object 2000-07-02 00:30:00. I also found that time variant units is standard for ncclimo generated climatology files for model and obs data. Not sure why MERRA2_Aerosols stands out..

tomvothecoder commented 3 weeks ago

@tomvothecoder When testing with decode_times = False, I found that for example one, the decoded_time is just not right. For instance, for the January mean climatology file, the time was decoded as time (time) object 2000-07-02 00:30:00. I also found that time variant units is standard for ncclimo generated climatology files for model and obs data. Not sure why MERRA2_Aerosols stands out..

Did you mean decode_times=True? If so, I will take a closer look.

chengzhuzhang commented 3 weeks ago

Did you mean decode_times=True? If so, I will take a closer look.

Yes!
I think the climatology data can't be decoded correctly by cftime somehow.

tomvothecoder commented 3 weeks ago

@tomvothecoder When testing with decode_times = False, I found that for example one, the decoded_time is just not right. For instance, for the January mean climatology file, the time was decoded as time (time) object 2000-07-02 00:30:00. I also found that time variant units is standard for ncclimo generated climatology files for model and obs data. Not sure why MERRA2_Aerosols stands out..

I verified that cftime is decoding the time coordinates correctly. The issue is that the raw time coordinates are not correct relative to the "units" attribute (10782720, 'minutes since 1980-01-01 00:30:00'). The time axis is also missing the "calendar" attribute, with "standard" being subbed in as the default.

I don't think this was caught in the CDAT codebase because the _create_annual_cycle() function avoids this issue by opening each dataset individually, replacing the time coordinate with the month integer, then concatenating the datasets into a single dataset along the time axis.

Although I'm not a fan of a custom I/O function to handle data quality issues, we have to implement a function similar to _create_annual_cycle() as a workaround for this specific case.

`cftime` decoding -- `cftime.DatetimeGregorian(2000, 7, 2, 0, 30, 0, 0, has_year_zero=False)`

from glob import glob

import cftime
import xcdat as xc

args = {
    "add_bounds": ["X", "Y"],
    "coords": "minimal",
    "compat": "override",
    "chunks": "auto",
}

filepath = "/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/MERRA2_Aerosols/MERRA2_Aerosols_[0-1][0-9]_*climo.nc"
paths = sorted(glob(filepath))

# filepath 1: '/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/MERRA2_Aerosols/MERRA2_Aerosols_01_198001_202101_climo.nc'
ds_raw_time = xc.open_mfdataset(paths[0], **args, decode_times=False)

# 10782720
time_int = ds_raw_time.time.values.item()
# 'minutes since 1980-01-01 00:30:00'
units = ds_raw_time.time.units
# None, so "standard"
calendar = ds_raw_time.time.attrs.get("calendar", "standard")

# cftime.DatetimeGregorian(2000, 7, 2, 0, 30, 0, 0, has_year_zero=False)
cftime.num2date(time_int, units, calendar=calendar)

`datetime.datetime` decoding -- `datetime.datetime(2000, 7, 2, 0, 30)`

import datetime

first_step = datetime.datetime(1980, 1, 1, hour=0, minute=30)
time_delta = datetime.timedelta(minutes=10782720)

# datetime.datetime(2000, 7, 2, 0, 30)
print(first_step + time_delta)

tomvothecoder commented 3 weeks ago

Although I'm not a fan of a custom I/O function to handle data quality issues, we have to implement a function similar to _create_annual_cycle() as a workaround for this specific case.

Actually, the easier thing to do is to ignore the decoded time values since they aren't used and to assume the order is 1-12 (Jan to Dec) like what the CDAT code does. The main caveat is that the time coordinates must be in ascending order, which they are when opening the datasets in Xarray/xCDAT with decode_times=True.

The only change needed is to update time_months in the plotter to range(1, 13). https://github.com/E3SM-Project/e3sm_diags/blob/784404b58fe31438ed17383638c342ba3b6b79aa/e3sm_diags/plot/annual_cycle_zonal_mean_plot.py#L103-L108

chengzhuzhang commented 3 weeks ago

@tomvothecoder I was searching some code example that reads data using open_mfdatasets with specifying order of files: https://stackoverflow.com/questions/75241585/using-xarrays-open-mfdataset-to-open-a-series-of-nc-files

ds = xarray.open_mfdataset(
    [f'{i}.nc' for i in range(10)],
    concat_dim=[
        pd.Index(np.arange(10), name="new_dim"),
    ],
    combine="nested",

)

Though I think your solution actually works okay, given that decode_times=True actually had time coordinate in ascend order (even though the decoded month value doesn't match with the actually climatology month. Update time_months in the plotter to range(1, 13), again put back the correct month index. I will do another regression test to confirm.

chengzhuzhang commented 3 weeks ago

@tomvothecoder I'm retesting this set will all variables, and realize that the memory issue came back. Then I tested again with the commit which resolved the memory issue (https://github.com/E3SM-Project/e3sm_diags/commit/15811b8a120afc8533572930294b1f252459362c). No errors. Some changes between (f2c3568) and https://github.com/E3SM-Project/e3sm_diags/commit/15811b8a120afc8533572930294b1f252459362c brought back the issue. I doubted the decode_times is the cause though.

tomvothecoder commented 3 weeks ago

@tomvothecoder I'm retesting this set will all variables, and realize that the memory issue came back. Then I tested again with the commit which resolved the memory issue (15811b8). No errors. Some changes between (f2c3568) and 15811b8 brought back the issue. I doubted the decode_times is the cause though.

Besides the recent plotter update, decode_times=True is the only other change from commit 159cdf5 (#798). Maybe decoding times is introducing an overhead, although it should be lazy in xCDAT. Also if climatology files are being used, the number of time coordinates to decode should be minimal. More debugging needed here.

chengzhuzhang commented 3 weeks ago

To change back decode_times = False did not help. And sadly, some git history was emptied with a few force-pushs. I tried to revert to recent commits, the concurrent.futures.process.BrokenProcessPool: always occur. I kind of running out of debugging method.

chengzhuzhang commented 3 weeks ago

Not sure the best solution to continue troubleshooting, after ruling out the args change for open_mfdataset. But what I did is to swap the dataset_xr.py from commit 15811b8 into latest code. (I do need to edit slightly to make the code work, i.e. change CLIMO_FREQ to Climo_Freq). No memory issue. At least it narrows down the problematic file, and I suspect some changes made in other PRs being merged introduced memory problem. I'm stepping through the differs to see what might be the cause.

The file diff for dataset_xr.py is here https://www.diffchecker.com/mTw8AWif/

tomvothecoder commented 3 weeks ago

I was actually in the middle of debugging here with my comment. I resolved the multiprocessing issue, it was my fault :(

Issues I resolved in f9a9ea7 (#798)

Slow .load() performance and sometimes multiprocessing issue (concurrent.futures.process.BrokenProcessPool)
- Root cause: My mistake here and sorry for removing git history with rebasing. I accidentally committed incorrect logic for keep_bnds = [var for var in all_vars if "bnd" or "bounds" in var] which kept all variables in the dataset before .load().
- Solution: Update keep_bnds = [var for var in all_vars if "bnd" in var or "bounds" in var]
With decode_times=True, I get ValueError: 'months since' units only allowed for '360_day' calendar for the TCO and SCO reference variables when writing out to netCDF
- Root cause: The source dataset ('/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/OMI-MLS/OMI-MLS_01_200501_201701_climo.nc') has the units 'months since 2005-01-01 00:00:00' and is missing the "calendar" attribute ("standard" is used as a default). Once again, the CDAT code does not run into this issue because it replaces time coordinates with month integers.
- Solution: Added _encode_time_coords() to driver to encode time coordinates to month integers

tomvothecoder commented 3 weeks ago

I re-ran the regression test notebook with the latest commit. I am still getting the following diffs:

AODVIS

Comparing:
/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean-debug/annual_cycle_zonal_mean/AOD_550/AOD_550-AODVIS-ANNUALCYCLE-global_ref.nc 
 /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/main/annual_cycle_zonal_mean/AOD_550/AOD_550-AODVIS-Annual-Cycle_test.nc
AODVIS
var_key AODVIS

Not equal to tolerance rtol=1e-05, atol=0

Mismatched elements: 1808 / 2160 (83.7%)
Max absolute difference: 0.12250582
Max relative difference: 91.14554689
 x: array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],...
 y: array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],...

ALBEDO -- It just looks like `np.inf` is being used in xCDAT while `np.nan` is used with CDAT. I recall this happening in other regression tests. Replacing `np.inf` with `np.nan` resolves this issue and vice versa.

Comparing:
/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean-debug/annual_cycle_zonal_mean/CERES-EBAF-TOA-v4.1/ceres_ebaf_toa_v4.1-ALBEDO-ANNUALCYCLE-global_ref.nc 
 /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/main/annual_cycle_zonal_mean/CERES-EBAF-TOA-v4.1/ceres_ebaf_toa_v4.1-ALBEDO-Annual-Cycle_test.nc
ALBEDO
var_key ALBEDO

Not equal to tolerance rtol=1e-05, atol=0

x and y nan location mismatch:
 x: array([[0.69877 , 0.695266, 0.68627 , ...,      inf,      inf,      inf],
       [0.712032, 0.706896, 0.69354 , ...,      inf,      inf,      inf],
       [0.765447, 0.743142, 0.738787, ..., 0.752918, 0.751204, 0.833122],...
 y: array([[0.69877 , 0.695266, 0.68627 , ...,      nan,      nan,      nan],
       [0.712033, 0.706896, 0.69354 , ...,      nan,      nan,      nan],
       [0.765447, 0.743142, 0.738787, ..., 0.752918, 0.751204, 0.833123],...

chengzhuzhang commented 3 weeks ago

@tomvothecoder this is big relief! I skimed through the file several times and noticed the changed line keep_bnds = [var for var in all_vars if "bnd" or "bound" in var], but was not careful enough to catch the problem! No worries about the variable AODVIS. I will update the .cfg file to replace this obs source with two new data source.

tomvothecoder commented 3 weeks ago

I added a debug script for AODVIS that compares the max, min, sum, and mean. All of the values look close.

I think the max relative diff is large because the values are close to 0.

import numpy as np
import xcdat as xc

dev_path = "/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean-debug/annual_cycle_zonal_mean/AOD_550/AOD_550-AODVIS-ANNUALCYCLE-global_ref.nc"
main_path = "/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/main/annual_cycle_zonal_mean/AOD_550/AOD_550-AODVIS-Annual-Cycle_test.nc"

var_a = xc.open_dataset(dev_path)["AODVIS"]
var_b = xc.open_dataset(main_path)["AODVIS"]

"""
Floating point comparison

AssertionError:
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 1808 / 2160 (83.7%)
Max absolute difference: 0.12250582
Max relative difference: 91.14554689
 x: array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],...
 y: array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],...
"""
np.testing.assert_allclose(var_a, var_b)

# Get the max of all values
# -------------------------
# 0.28664299845695496
print(var_a.max().item())
# 0.2866430557436412
print(var_b.max().item())

# Get the min of all values
# -------------------------
# 0.0
print(var_a.min().item())
# 0.0
print(var_b.min().item())

# Get the sum of all values
# -------------------------
# 224.2569122314453
print(var_a.sum().item())
# 224.25691348856003
print(var_b.sum().item())

# Get the mean of all values
# -------------------------
# 0.10382264107465744
print(var_a.mean().item())
# 0.1038226451335926
print(var_b.mean().item())

# %%
# Get the max absolute diff
# -------------------------
# 0.12250582128763199
print((var_a - var_b).max().item())

chengzhuzhang commented 3 weeks ago

I think the max relative diff is large because the values are close to 0.

yeah, the values and metrics look all very close. Based on the plots i saw earlier, months were off. Anyway based on the comments from https://github.com/E3SM-Project/e3sm_diags/pull/624 I retired AODVIS from MACv1 in lat-lon, but missed the annual_cycle_zonal_mean set. I made the update https://github.com/E3SM-Project/e3sm_diags/pull/798/commits/7ba0900327640f8e9417d59aab05978922b38544.

chengzhuzhang commented 3 weeks ago

@tomvothecoder I think we can merge after CI/CD test is completed!

E3SM-Project / e3sm_diags