Open TonyB9000 opened 7 months ago
All MPAS processing is generating these errors (Omon and SImon). A summary:
Compressed stack traces:
Code Path: "home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local/lib/python3.12/site-packages/
TEST: CMIP6.CMIP.E3SM-Project.E3SM-2-1.piControl.r1i1p1f1.SImon.siv.gr
2024-04-16 20:26:22,381_381:INFO:handle:Starting siv
e3sm_to_cmip/__main__.py", line 834, in _run_serial [ name = handler_method( ]
e3sm_to_cmip/cmor_handlers/mpas_vars/siv.py", line 65, in handle [ ds = mpas.add_time(ds, dsIn) ]
e3sm_to_cmip/mpas.py", line 161, in add_time [ xtimeEnd = ["".join(x.astype("U")).strip() for x in xtimeEnd.values] ]
xarray/core/dataarray.py", line 770, in values [ return self.variable.values ]
xarray/core/variable.py", line 508, in values [ return _as_array_or_item(self._data) ]
xarray/core/variable.py", line 306, in _as_array_or_item [ data = np.asarray(data) ]
dask/array/core.py", line 1700, in __array__ [ x = self.compute() ]
dask/base.py", line 375, in compute [ (result,) = compute(self, traverse=False, **kwargs) ]
dask/base.py", line 661, in compute [ results = schedule(dsk, keys, **kwargs) ]
xarray/core/indexing.py", line 484, in __array__ [ return np.asarray(self.get_duck_array(), dtype=dtype) ]
xarray/core/indexing.py", line 487, in get_duck_array [ return self.array.get_duck_array() ]
xarray/core/indexing.py", line 664, in get_duck_array [ return self.array.get_duck_array() ]
xarray/core/indexing.py", line 551, in get_duck_array [ array = self.array[self.key] ]
xarray/backends/netCDF4_.py", line 101, in __getitem__ [ return indexing.explicit_indexing_adapter( ]
xarray/core/indexing.py", line 858, in explicit_indexing_adapter [ result = raw_indexing_method(raw_key.tuple) ]
xarray/backends/netCDF4_.py", line 114, in _getitem [ array = getitem(original_array, key) ]
(FOLLOWED BY)
File "src/netCDF4/_netCDF4.pyx", line 4948, in netCDF4._netCDF4.Variable.__getitem__
File "src/netCDF4/_netCDF4.pyx", line 4551, in netCDF4._netCDF4.Variable.shape.__get__
File "src/netCDF4/_netCDF4.pyx", line 4499, in netCDF4._netCDF4.Variable._getdims
File "src/netCDF4/_netCDF4.pyx", line 2034, in netCDF4._netCDF4._ensure_nc_success
NetCDF: Not a valid ID
^[[91m[-] ^[[0m^[[1mError running handlers: siv^[[0m
^M 0%| | 0/1 [38:48<?, ?it/s]
TEST: CMIP6.CMIP.E3SM-Project.E3SM-2-1.piControl.r1i1p1f1.Omon.thetao.gr.log
(SAME stack trace as above, then …)
(FOLLOWED BY)
File "src/netCDF4/_netCDF4.pyx", line 4972, in netCDF4._netCDF4.Variable.__getitem__
File "src/netCDF4/_netCDF4.pyx", line 5930, in netCDF4._netCDF4.Variable._get
File "src/netCDF4/_netCDF4.pyx", line 2034, in netCDF4._netCDF4._ensure_nc_success
Resource temporarily unavailable
[-] Error running handlers: thetao
Error 11: Resource temporarily unavailable
Error 11: Resource temporarily unavailable
Error 11: Resource temporarily unavailable
Error 11: Resource temporarily unavailable
Error 11: Resource temporarily unavailable
0%| | 0/1 [00:05<?, ?it/s]
@TonyB9000 the Minimal Complete Verifiable Example (MVCE) no longer works: when I check I only find eam
data in /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/native_data. When create this example, you may want to isolate the test data so others can test. But for this issue, I can test with some v2 data available on acme1.
Alright, I can reproduce this with 12 months data with v1.11.2:
e3sm_to_cmip -s --realm SImon --var-list siv --map /p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc --input-path /p/user_pub/e3sm/zhang40/e2c_tony/e2c_test_data/v2.mpassi_input/ --output-path /p/user_pub/e3sm/zhang40/e2c_tony/tests --user-metadata /p/user_pub/e3sm/zhang40/e2c_tony/e2c_test_data/holodeck/input/historical_r1i1p1f1.json --tables-path /p/user_pub/e3sm/staging/resource/cmor/cmip6-cmor-tables/Tables
I suspect that the "Resource temporarily unavailable " error is from the use of "lock=False" https://github.com/E3SM-Project/e3sm_to_cmip/commit/0f91fb4abc839e2ec04fe8e770ad7308a41c6620. And when testing v1.11.1, things work okay.
With lock=True
, large operations hang sometime, with "lock=False" cause "Resource temporarily unavailable " for some variables.
@tomvothecoder I don't know if this is specific to acme1's file system, but should we consider revert to use 1.11.1 for upcoming e3sm_unified deployment? I think let's decide after a more thorough investigation.
I suspect that the "Resource temporarily unavailable " error is from the use of "lock=False" 0f91fb4. And when testing v1.11.1, things work okay.
With
lock=True
, large operations hang sometime, with "lock=False" cause "Resource temporarily unavailable " for some variables.
I think you might be right about lock=False
causing "Resource temporarily unavailable"
, possibly due to resource contention with multiprocessing.
Related issues
@tomvothecoder I don't know if this is specific to acme1's file system, but should we consider revert to use 1.11.1 for upcoming e3sm_unified deployment? I think let's decide after a more thorough investigation.
I agree that more investigation is needed before deciding what to do next. v1.11.2 includes another fix for parallel=False
to avoid NetCDF: Not a Valid ID
with open_mfdataset()
, which we need to keep too.
My understanding is that Tony is still using acme1 and datasm
with e3sm_to_cmip
. Is e3sm_to_cmip
used by anybody else in E3SM Unified, besides zppy
? If zppy
tests pass, then we have the option of keeping v1.11.2 for now with the risk of this bug coming up. It sounds like Xylar pressing to move forward with E3SM Unified v1.10.0, so I don't know how much time we have.
Minimal Complete Verifiable Example (MVCE)
Command line issued (directly at console): e3sm_to_cmip -v siv -u /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/metadata/historical_r4i1p1f1.json -t /p/user_pub/e3sm/staging/resource/cmor/cmip6-cmor-tables/Tables -o /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/product -i /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/native_data -s --realm SImon --map /p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc >> Local_log 2>&1 &
For future reference, you can use a Python script to invoke e3sm_to_cmip
. This makes debugging much easier by enabling the use of the Python debugger. I included examples in the documentation here: https://e3sm-to-cmip.readthedocs.io/en/latest/development-guide.html
I would assume this issue occurs with multiprocessing ( I just noticed the log output says that it is serial.serial=False
), but I'm not 100% sure.
I suspect that the "Resource temporarily unavailable " error is from the use of "lock=False" 0f91fb4. And when testing v1.11.1, things work okay. With
lock=True
, large operations hang sometime, with "lock=False" cause "Resource temporarily unavailable " for some variables.I think you might be right about
lock=False
causing"Resource temporarily unavailable"
, possibly due to resource contention with multiprocessing.Related issues
* [NETCDF - Resource temporarily unavailable dask/dask#2657](https://github.com/dask/dask/issues/2657) * [netCDF4 Resource unavailable error pydata/xarray#4304](https://github.com/pydata/xarray/issues/4304)
@tomvothecoder I don't know if this is specific to acme1's file system, but should we consider revert to use 1.11.1 for upcoming e3sm_unified deployment? I think let's decide after a more thorough investigation.
I agree that more investigation is needed before deciding what to do next. v1.11.2 includes another fix for
parallel=False
to avoidNetCDF: Not a Valid ID
withopen_mfdataset()
, which we need to keep too.My understanding is that Tony is still using acme1 and
datasm
withe3sm_to_cmip
. Ise3sm_to_cmip
used by anybody else in E3SM Unified, besideszppy
? Ifzppy
tests pass, then have the option of keeping v1.11.2 for now with the risk of this bug coming up. It sounds like Xylar pressing to move forward with E3SM Unified v1.10.0, so I don't know how much time we have.
Thanks for your insight. Based on Tony's recent operation, I think the zppy application, only including processing atmopshere and land variables are not impacted. So perhaps, we can spend time to resovle this more ocean, sea-ice related error on the side.
Thanks for your insight. Based on Tony's recent operation, I think the zppy application, only including processing atmopshere and land variables are not impacted. So perhaps, we can spend time to resovle this more ocean, sea-ice related error on the side.
Sounds good. I will convert Tony's command into a Python script and try to do some debugging today.
@tomvothecoder @chengzhuzhang Just a side note - more than half of the v2_1 MPAS jobs (run with datasm back in February) completed successfully. So, I was going to examine the generation dates to see if the first appearance of failures corresponds to any changes I (or others) make have made to e2c.
I am interested in the python-driver you (Tom) mentioned, and will investigate.
@chengzhuzhang Jill wrote "When create this example, you may want to isolate the test data so others can test."
Sorry about that. Yes, I neglected to isolate that data for retesting.
@tomvothecoder @chengzhuzhang Just for the record, we are running into the same errors on Chrysalis - so this is not limited to acme1.
KEY FACT: Circa February 10, both Omon and SImon jobs (v2_1 data) were completing successfully (via datasm postprocess). Soon thereafter, sporadic failures became the norm. I have maintained all of the runlogs for the 296 (v2_1) MPAS jobs, of which 26 are consistently failing. I will seek to characterize the failure dates and types, to see if I can isolate a date upon which to focus.
I would suggest to revert the change that sets lock=False
. If i understand correctly, this change warrants all mpas variables to fail, no matter on acme1 or Chrysalis.
Turns out, lock=False
was set on both acme1 and chrysalis. Setting it to "True" on chrysalis lead to a small change at the end of the stack-trace:
File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/backends/netCDF4_.py", line 112, in _getitem
with self.datastore.lock:
^M 0%| | 0/1 [00:00<?, ?it/s]
'bool' object does not support the context manager protocol
With "lock" parameter elided, the v2_1 1pctCO2 Omon.sosga completes successfully, as it had in February. But Omon.so fails. The stack-trace:
2024-04-29 15:13:07,374 [INFO]: so.py(handle:47) >> Starting so
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 831, in _run_serial
name = handler_method(
^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/e3sm_to_cmip/cmor_handlers/mpas_vars/so.py", line 68, in handle
ds = mpas.remap(ds, 'mpasocean', mappingFileName)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 87, in remap
write_netcdf(ds, inFileName, unlimited="time")
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 366, in write_netcdf
ds.to_netcdf(fileName, encoding=encodingDict, unlimited_dims=unlimited)
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/core/dataset.py", line 2298, in to_netcdf
return to_netcdf( # type: ignore # mypy cannot resolve the overloads:(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/backends/api.py", line 1339, in to_netcdf
dump_to_store(
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/backends/api.py", line 1386, in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/backends/common.py", line 393, in store
variables, attributes = self.encode(variables, attributes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/backends/common.py", line 482, in encode
variables, attributes = cf_encoder(variables, attributes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/conventions.py", line 795, in cf_encoder
new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/conventions.py", line 196, in encode_cf_variable
var = coder.encode(var, name=name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/coding/variables.py", line 332, in encode
fill_value = pop_to(encoding, attrs, "_FillValue", name=name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/coding/variables.py", line 219, in pop_to
safe_setitem(dest, key, value, name=name)
File "/home/bartoletti1/mambaforge/envs/dsm_prod_inst_e2c_local2/lib/python3.12/site-packages/xarray/coding/variables.py", line 200, in safe_setitem
raise ValueError(
0%| | 0/1 [00:23<?, ?it/s]
failed to prevent overwriting existing key _FillValue in attrs on variable 'so'. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.
[-] Error running handlers: so
I ran the same tests on Chrysalis (v2_1 data, 1pctCO2, Omon.sosga and Omon.so), and both failed - but for different reasons. sosga (which passed on acme1) failed with:
2024-04-30 10:43:23,692 [ERROR]: mpas.py(write_cmor:451) >> Error in cmor.write for sosga
Traceback (most recent call last):
File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 823, in _run_serial
name = handler_method(
^^^^^^^^^^^^^^^
TypeError: handle() takes 3 positional arguments but 5 were given
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 444, in write_cmor
cmor.write(
File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/cmor/pywrapper.py", line 834, in write
return _cmor.write(var_id, data, data_type, file_suffix, ntimes_passed,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_cmor.CMORError: Problem with 'cmor.write'. Please check the logfile (if defined).
2024-04-30 10:43:23,692 [ERROR]: mpas.py(write_cmor:451) >> Error in cmor.write for sosga
Traceback (most recent call last):
File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 823, in _run_serial
name = handler_method(
^^^^^^^^^^^^^^^
TypeError: handle() takes 3 positional arguments but 5 were given
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 444, in write_cmor
cmor.write(
File "/home/ac.bartoletti1/anaconda3/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/cmor/pywrapper.py", line 834, in write
return _cmor.write(var_id, data, data_type, file_suffix, ntimes_passed,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_cmor.CMORError: Problem with 'cmor.write'. Please check the logfile (if defined).
^M100%|██████████| 1/1 [00:19<00:00, 19.45s/it]^M100%|██████████| 1/1 [00:19<00:00, 19.45s/it]
Converting sosga to float32
The test with "so" failed as well, but appears identical to the acme1 failure ("failed to prevent overwriting existing key _FillValue in attrs on variable 'so'").
Here is the cmor_logs logfile output for the failed sosga on chrysalis. FWIW - I harbor high emnity for this logfile format.
As I wrote in slack, I will ensure that the installed cmor tables are up-to-date on both acme1 and chrysalis, then rerun both of the v2_1 Omon.sosga and Omon.so tests on each. I will also make sure that the tables-path is reiterated in the main log list of parameters applied.
Hi @TonyB9000 for further trouble shooting, do you have intermediate files saved (i.e. before cmorization) for the variables that having this _Fillvalue issues, for example: for so from v2_1.LR.1pctCO2_0101.mpaso? Do you have a e3sm_to_cmip commandline for reproducing this issue?
Also, please submit a PR that elides "lock=False", I believe this fixed the concurrency issue, but leaves the _FillValue issues.
Did the PR, and the merge (tests successful - whatever they are).
I'll run a test of the failed _fillValue issue - but e2c "hides" the interim regridded stuff in volatile "temp" directories. I'll see if they can be retained.
@chengzhuzhang I just ran:
dsm_generate_CMIP6_dsid_list_2.sh TEST to_test_cmip6_v2_1_1pctCO2_Omon_so
which create a script that issues:
e3sm_to_cmip -v so -u /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/metadata/1pctCO2_r1i1p1f1.json -t /p/user_pub/e3sm/staging/resource/cmor/cmip6-cmor-tables/Tables -o /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/product -i /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data -s --realm Omon --map /p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc >> $the_sublog 2>&1
The content of the "sublog" (/p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/caselogs/20240515_195553_552629-CMIP6.CMIP.E3SM-Project.E3SM-2-1.1pctCO2.r1i1p1f1.Omon.so.gr.sublog
) reads:
20240515_195605_598:INFO:e3sm_to_cmip.util:find_mpas_files:find_mpas_files: component = mpaso, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
2024-05-15 19:56:05,598 [INFO]: util.py(find_mpas_files:497) >> find_mpas_files: component = mpaso, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
2024-05-15 19:56:05,598 [INFO]: util.py(find_mpas_files:497) >> find_mpas_files: component = mpaso, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
20240515_195605_598:INFO:e3sm_to_cmip.util:find_mpas_files:results found: 12 items
2024-05-15 19:56:05,598 [INFO]: util.py(find_mpas_files:507) >> results found: 12 items
2024-05-15 19:56:05,598 [INFO]: util.py(find_mpas_files:507) >> results found: 12 items
20240515_195605_599:INFO:e3sm_to_cmip.util:find_mpas_files:find_mpas_files: component = mpas_mesh, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
2024-05-15 19:56:05,599 [INFO]: util.py(find_mpas_files:497) >> find_mpas_files: component = mpas_mesh, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
2024-05-15 19:56:05,599 [INFO]: util.py(find_mpas_files:497) >> find_mpas_files: component = mpas_mesh, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
20240515_195605_599:INFO:e3sm_to_cmip.util:find_mpas_files:component mpas_mesh found: v2_1.LR.1pctCO2_0101.mpaso.rst.0002-01-01_00000.nc
2024-05-15 19:56:05,599 [INFO]: util.py(find_mpas_files:547) >> component mpas_mesh found: v2_1.LR.1pctCO2_0101.mpaso.rst.0002-01-01_00000.nc
2024-05-15 19:56:05,599 [INFO]: util.py(find_mpas_files:547) >> component mpas_mesh found: v2_1.LR.1pctCO2_0101.mpaso.rst.0002-01-01_00000.nc
20240515_195605_600:INFO:e3sm_to_cmip.util:find_mpas_files:find_mpas_files: component = mpas_map, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
2024-05-15 19:56:05,600 [INFO]: util.py(find_mpas_files:497) >> find_mpas_files: component = mpas_map, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
2024-05-15 19:56:05,600 [INFO]: util.py(find_mpas_files:497) >> find_mpas_files: component = mpas_map, path = /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101/native_data
20240515_195605_615:INFO:e3sm_to_cmip.util:find_mpas_files:component mpas_map found: /p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc
2024-05-15 19:56:05,615 [INFO]: util.py(find_mpas_files:556) >> component mpas_map found: /p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc
2024-05-15 19:56:05,615 [INFO]: util.py(find_mpas_files:556) >> component mpas_map found: /p/user_pub/e3sm/staging/resource/maps/map_EC30to60E2r2_to_cmip6_180x360_aave.20220301.nc
2024-05-15 19:56:05,615 [INFO]: so.py(handle:48) >> Starting so
2024-05-15 19:56:05,615 [INFO]: so.py(handle:48) >> Starting so
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/__main__.py", line 836, in _run_serial
name = handler_method(
^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/cmor_handlers/mpas_vars/so.py", line 69, in handle
ds = mpas.remap(ds, 'mpasocean', mappingFileName)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 93, in remap
write_netcdf(ds, inFileName, unlimited="time")
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/e3sm_to_cmip/mpas.py", line 373, in write_netcdf
ds.to_netcdf(fileName, encoding=encodingDict, unlimited_dims=unlimited)
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/core/dataset.py", line 2298, in to_netcdf
return to_netcdf( # type: ignore # mypy cannot resolve the overloads:(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/backends/api.py", line 1339, in to_netcdf
dump_to_store(
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/backends/api.py", line 1386, in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/backends/common.py", line 393, in store
variables, attributes = self.encode(variables, attributes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/backends/common.py", line 482, in encode
variables, attributes = cf_encoder(variables, attributes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/conventions.py", line 795, in cf_encoder
new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/conventions.py", line 196, in encode_cf_variable
var = coder.encode(var, name=name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/coding/variables.py", line 332, in encode
fill_value = pop_to(encoding, attrs, "_FillValue", name=name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/coding/variables.py", line 219, in pop_to
safe_setitem(dest, key, value, name=name)
File "/home/bartoletti1/mambaforge/envs/dsm_prod_e2c_local/lib/python3.12/site-packages/xarray/coding/variables.py", line 200, in safe_setitem
raise ValueError(
0%| | 0/1 [00:19<?, ?it/s]
failed to prevent overwriting existing key _FillValue in attrs on variable 'so'. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.
[-] Error running handlers: so
The most recent elements of my "TMPDIR" (/p/user_pub/e3sm/bartoletti1/tmp
) are:
-rw-------. 1 bartoletti1 publishers 27493 May 9 05:45 tmpfhs53hel
-rw-------. 1 bartoletti1 publishers 0 May 15 12:56 tmpz89h3sws
-rw-------. 1 bartoletti1 publishers 239 May 15 12:56 tmpzgznny1v
and ncdump /p/user_pub/e3sm/bartoletti1/tmp/tmpzgznny1v
gives
netcdf tmpzgznny1v {
}
@chengzhuzhang Presently, the only variables that are exhibiting these error are a subset of the Omon (so, sos, thetao, thkcello, tos, volcello), and ONLY for 1pctCO2 and abrupt-4xCO2.
The "job directory" created for this job (all 1pctCO2 vars), when listed chronologically, show that only the last 3 were touched:
(see /p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.1pctCO2_0101
)
drwxrwxr-x. 2 bartoletti1 publishers 4096 Apr 3 12:41 rgr_fixed_vars
drwxrwxr-x. 2 bartoletti1 publishers 4096 Apr 3 12:41 rgr_vert
drwxrwxr-x. 2 bartoletti1 publishers 4096 Apr 3 12:41 metadata
drwxrwxr-x. 3 bartoletti1 publishers 4096 Apr 3 16:57 zhold_product_20240403
-rw-rw-r--. 1 bartoletti1 publishers 24818 Apr 4 10:01 My_own_log
drwxrwxr-x. 3 bartoletti1 publishers 4096 Apr 4 17:10 zhold_product_20240404
-rw-rw-r--. 1 bartoletti1 publishers 0 Apr 5 11:03 try_log
drwxrwxr-x. 2 bartoletti1 publishers 4096 Apr 5 11:03 logs
-rw-rw-r--. 1 bartoletti1 publishers 151 Apr 9 14:13 findlog
drwxrwxr-x. 2 bartoletti1 publishers 4096 Apr 12 10:28 native_out
drwxrwxr-x. 2 bartoletti1 publishers 4096 Apr 12 10:28 rgr
drwxrwxr-x. 3 bartoletti1 publishers 4096 May 10 12:31 product
drwxrwxr-x. 2 bartoletti1 publishers 262144 May 15 12:55 native_data
drwxrwxr-x. 2 bartoletti1 publishers 262144 May 15 13:04 caselogs
drwxrwxr-x. 4 bartoletti1 publishers 4096 May 15 13:13 scripts
Thank you @TonyB9000! Could you open permission for -rw-------. 1 bartoletti1 publishers 27493 May 9 05:45 tmpfhs53hel -rw-------. 1 bartoletti1 publishers 0 May 15 12:56 tmpz89h3sws -rw-------. 1 bartoletti1 publishers 239 May 15 12:56 tmpzgznny1v
Everything in /p/user_pub/e3sm/bartoletti1/tmp should be accessible now.
@chengzhuzhang Presently, the only variables that are exhibiting these error are a subset of the Omon (so, sos, thetao, thkcello, tos, volcello), and ONLY for 1pctCO2 and abrupt-4xCO2.
I'm trying to visualize native MPAS output through uxarray, but it doesn't see to work out of box with their docs. @xylar We ran into _FillValue issue only with two experiments. I can't help thinking if it is relevant to the recent coupled group endeavor about model crushing in warmer climate. Do you have some script that I can use to do some quick view of MPAS data?
@xylar Never mind. I made the script working with uxarray. Checking native data now...
@TonyB9000 Definitely, we should consider to break large 3d variables into year chunks...I see a 316 G file with 6000 time steps. Then tmp files at this point doesn't really helpful. Could you try testing with the variable tos (a 2d variable) run e3sm_to_cmip every 10 years (ex 1-10, 11-20xxx), to see when the crash happen?
@chengzhuzhang The last run (depicted above with "2024-05-15 19:56:05,615 [INFO]: so.py(handle:48) >> Starting so", was only given 1 year of data to run with. (That is the ONLY way I know of to limit the years and file sizes. e3sm_to_cmip has no command line option to limit the years. You must create external code that calls e2c sequentially on segments of data. That is what the CWL workflows were doing...)
@chengzhuzhang I am doing another run, hopefully with more log results.
@TonyB9000 I think I found the problem...
@xylar it looks like I need to bug you after all...I think the hiccup in e3sm_to_cmip Tony has been experiencing is caused by the additional attribute _FillValue
presents in the ONLY for 1pctCO2 and abrupt-4xCO2 v2.1 simulations. I'm listing below:
timeMonthly_avg_layerThickness:_FillValue = 9.96921e+36f
timeMonthly_avg_normalVelocity:_FillValue = 9.96921e+36f
timeMonthly_avg_activeTracers_temperature:_FillValue = 9.96921e+36f
timeMonthly_avg_activeTracers_salinity:_FillValue = 9.96921e+36f
I think these are the native variables used to derive Omon (so, sos, thetao, thkcello, tos, volcello). So question, any clue why we suddenly have _FillValues in these 2 experiments...I think originally we just use cell mask, with the presence of _FillValue, how should we modify e2c?
@chengzhuzhang @xylar Long Long Ago, in a Galaxy Far Far Away . . . we addressed a similar issue by editing the native data with ncatted:
(in /p/user_pub/e3sm/bartoletti1/Operations/8_Corrections/history/FIX_FillValue/remove_FillValue_attribute.sh)
[for afile in `cat $filelist`; do
ncatted -a _FillValue,,d,,, $target_path/$afile
echo "cleaned $target_path/$afile"
done
But I don't know if this past solution applies (I think that was a "_FillValue=NaNf" issue)
@chengzhuzhangand @TonyB9000, we added fill values to as many variables as we could to increase CF compliance about a year and a half ago: https://github.com/E3SM-Project/E3SM/pull/5325 These fill values should be present on all E3SM v3 simulations, but I guess you haven't had to post-process any of those yet? I'm not sure why the 4xCO2 and 1pctCO2 are the only v2.1 simulations to show this behavior. I would expect this in all v2.1 and later simulations. Are these the first v2.1 simulations you have processed?
Anyway, I'm happy to look into fixing this. I suspect the fix is simply to drop the attribute if it is present, as the error message above suggests. I would do this in the handlers rather than with ncatted
because that saves an extra step and some headache.
@xylar Thank you for clarifying and the quick fix. We have not started processing v3, and was able to finish other v2.1 processing, including historical, picontrol and amip. I'm still curious, why ONLY 1pctCO2 and abrupt-4xCO2 has the _FillValue added.
I have no idea. I would expect all v2.1 runs with ocean and sea-ice output to have this issue. The feature was merged before the v2.1 tag was created.
What happened?
e3sm_to_cmip command leads to a cascade of xarray errors, beginning with "cmor_handlers/mpas_vars/siv.py (ds.compute())" call.
What did you expect to happen? Are there are possible answers you came across?
n/a
Minimal Complete Verifiable Example (MVCE)
CLI command
Python script
Relevant log output
Anything else we need to know?
The input directory (
/p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/native_data
) contains symlinks to the actual native data, as well as to the region file:EC30to60E2r2_mocBasinsAndTransects20210623.nc (/p/user_pub/e3sm/staging/resource/maps/EC30to60E2r2_mocBasinsAndTransects20210623.nc)
and to the namefile:
mpassi_in (/p/user_pub/e3sm/warehouse/E3SM/2_1/historical/LR/sea-ice/native/namefile/fixed/ens4/v0/mpassi_in)
as is required for MPAS cmorizing.
Environment
populated config files : /home/bartoletti1/mambaforge/.condarc conda version : 24.1.2 conda-build version : not installed python version : 3.10.6.final.0 solver : libmamba (default) virtual packages : archspec=1=broadwell conda=24.1.2=0 glibc=2.17=0 linux=3.10.0=0 __unix=0=0 base environment : /home/bartoletti1/mambaforge (writable) conda av data dir : /home/bartoletti1/mambaforge/etc/conda conda av metadata url : None channel URLs : https://conda.anaconda.org/conda-forge/linux-64 https://conda.anaconda.org/conda-forge/noarch package cache : /home/bartoletti1/mambaforge/pkgs /home/bartoletti1/.conda/pkgs envs directories : /home/bartoletti1/mambaforge/envs /home/bartoletti1/.conda/envs platform : linux-64 user-agent : conda/24.1.2 requests/2.31.0 CPython/3.10.6 Linux/3.10.0-1160.108.1.el7.x86_64 rhel/7.9 glibc/2.17 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.7 UID:GID : 61843:4061 netrc file : None offline mode : False