Closed TonyB9000 closed 6 months ago
Things to try next:
os.environ["HDF_USE_FILE_LOCKING"] = "FALSE"
-> https://github.com/pydata/xarray/issues/2376#issuecomment-419309486lock=False
for remap_seaice_sgs()
-> https://github.com/E3SM-Project/e3sm_to_cmip/blob/0f91fb4abc839e2ec04fe8e770ad7308a41c6620/e3sm_to_cmip/mpas.py#L44-L74@tomvothecoder
Given a file containing one or more CMIP6 dataset_ids, the script:
/home/bartoletti1/gitrepo/datasm/datasm/scripts/dsm_generate_CMIP6_dsid_list_2.sh
will take the positional parameters "WORK
tmp/<case_id>/scripts/<dataset_id>-Generate_CMIP6.sh
that is the "minimum script" to conduct end-to-end ([NCO] + e3sm_to_cmip) processing.
NOTE: The parent script (dsm_generate_CMIP6_dsid_list_2.sh) requires that you have
export DSM_GETPATH=/p/user_pub/e3sm/staging/Relocation/.dsm_get_root_path.sh
in your .bashrc so that various datasm tools that garner the dataset_id_specific parameters (data location, proper mapfile, etc etc) can be obtained. But the resulting subordinate script contains the fully-expressed command lines (with ALL paths fully-qualified) and can be run from anywhere. The results will be placed into various subdirectories under tmp/<case_id>
.
ADDENDUM: If you substitute "TEST" for "WORK" as the first parameter, only 1 year of data will be processed, and the resulting output files will not be moved into the warehouse (they will remain in the tmp/
Near the top of the script, you can set "dryrun=1", in which case the generated subordinate scripts will not be run, only produced and ready to run.
What happened?
Occasional hang-forever on "to_netcdf". No errors or exceptions raised.
What did you expect to happen? Are there are possible answers you came across?
Similar behaviors have been noted in:
https://github.com/pydata/xarray/issues/4710
“Most of the time, this command works just fine. But in 30% of the cases, this would just... stop and stall. One or more of the workers would simply stop working without coming back or erroring.”
and then:
If you run this once, it's typically fine. But run it over and over again in a loop, and it'll eventually hang on mfd.to_netcdf. However if I set lock=False then it runs fine every time.
It seems related to a discussion regarding whether HDF5 is/is-not thread-safe, and whether locking is-not/is necessary, respectively.
Minimal Complete Verifiable Example (MVCE)
Relevant log output
Anything else we need to know?
The salient history of local discussion, oldest to newest:
Mar 13, 1:07 PM
Environment
populated config files : /home/bartoletti1/mambaforge/.condarc conda version : 24.1.2 conda-build version : not installed python version : 3.10.6.final.0 solver : libmamba (default) virtual packages : archspec=1=broadwell conda=24.1.2=0 glibc=2.17=0 linux=3.10.0=0 __unix=0=0 base environment : /home/bartoletti1/mambaforge (writable) conda av data dir : /home/bartoletti1/mambaforge/etc/conda conda av metadata url : None channel URLs : https://conda.anaconda.org/conda-forge/linux-64 https://conda.anaconda.org/conda-forge/noarch package cache : /home/bartoletti1/mambaforge/pkgs /home/bartoletti1/.conda/pkgs envs directories : /home/bartoletti1/mambaforge/envs /home/bartoletti1/.conda/envs platform : linux-64 user-agent : conda/24.1.2 requests/2.31.0 CPython/3.10.6 Linux/3.10.0-1160.108.1.el7.x86_64 rhel/7.9 glibc/2.17 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.7 UID:GID : 61843:4061 netrc file : None offline mode : False