NCAS-CMS / cf-python

A CF-compliant Earth Science data analysis library
http://ncas-cms.github.io/cf-python
MIT License
119 stars 19 forks source link

CFA Virtualisation using CMIP6 example data: Unable to aggregate #793

Open dwest77a opened 2 months ago

dwest77a commented 2 months ago

Example CMIP6 data (JASMIN)

files = [
    '/badc/cmip6/data/CMIP6/ScenarioMIP/CNRM-CERFACS/CNRM-ESM2-1/ssp119/r1i1p1f2/3hr/huss/gr/v20190328/huss_3hr_CNRM-ESM2-1_ssp119_r1i1p1f2_gr_201501010300-203501010000.nc',
    '/badc/cmip6/data/CMIP6/ScenarioMIP/CNRM-CERFACS/CNRM-ESM2-1/ssp119/r1i1p1f2/3hr/huss/gr/v20190328/huss_3hr_CNRM-ESM2-1_ssp119_r1i1p1f2_gr_203501010300-205501010000.nc',
    '/badc/cmip6/data/CMIP6/ScenarioMIP/CNRM-CERFACS/CNRM-ESM2-1/ssp119/r1i1p1f2/3hr/huss/gr/v20190328/huss_3hr_CNRM-ESM2-1_ssp119_r1i1p1f2_gr_205501010300-207501010000.nc',
    '/badc/cmip6/data/CMIP6/ScenarioMIP/CNRM-CERFACS/CNRM-ESM2-1/ssp119/r1i1p1f2/3hr/huss/gr/v20190328/huss_3hr_CNRM-ESM2-1_ssp119_r1i1p1f2_gr_207501010300-209501010000.nc',
    '/badc/cmip6/data/CMIP6/ScenarioMIP/CNRM-CERFACS/CNRM-ESM2-1/ssp119/r1i1p1f2/3hr/huss/gr/v20190328/huss_3hr_CNRM-ESM2-1_ssp119_r1i1p1f2_gr_209501010300-210101010000.nc'
]

Attempted to aggregate the first two example files (successful)

f = cf.read(files)
g = cf.aggregate(f[:2])

Normal cf.write functions properly here by creating a combined netCDF file of both files, but using with cfa=True results in one of the following, depending on if I take the whole of both Fields (116880 time steps):

RuntimeError: NetCDF: HDF error

or a subselection of the last 10 from file 1 and the first 10 from file 2 .

g = cf.aggregate([ f[0][-10:], f[1][:10] ])

File "/home/users/dwest77/Documents/cfa_python_dw/cf_dw/cf_python/cf/read_write/netcdf/netcdfwrite.py", line 106, in _write_as_cfa
    raise ValueError(
ValueError: Can't write <CF Field: specific_humidity(time(20), latitude(128), longitude(256)) 1> as a CFA-netCDF aggregation variable. Consider setting cfa={'strict': False}

cf-python 3.16.2 (latest) cfdm 1.11.1.0 (latest)

davidhassell commented 2 months ago

Hi Dan, short answer (because I'm going home!), try:

>>> f = cf.read(files, chunks=None)
>>> cf.write(f, 'cfa.nc', cfa=True)

I did this on your data on JASMIN and it worked OK.

Long answer and explanations to follow ...

dwest77a commented 2 months ago

Tried this exactly as you've stated but I still get the runtime error with netcdf. FYI I'm using netcdf4==1.7.1.post1. I can add my whole conda package list here if needed. I'm off as well now!

davidhassell commented 2 months ago

Interesting. I was using netCDF4==1.6.5 when it worked fine, but I got a seg fault with 1.7.1.post1

>>> cf.environment(paths=False)
Platform: Linux-3.10.0-1160.114.2.el7.x86_64-x86_64-with-glibc2.17
HDF5 library: 1.12.2
netcdf library: 4.9.3-development
udunits2 library: ~/miniconda3/lib/libudunits2.so.0
esmpy/ESMF: not available
Python: 3.12.2
dask: 2024.7.0
netCDF4: 1.6.5
psutil: 5.9.8
packaging: 23.1
numpy: 1.26.4
scipy: 1.12.0
matplotlib: not available
cftime: 1.6.3
cfunits: 3.3.7
cfplot: not available
cfdm: 1.11.1.0
cf: 3.16.2
>>>
davidhassell commented 2 months ago

netCDF4==1.7.0 works for me, too, but I notice that 1.7.0 and 1.7.1 have both been yanked (https://pypi.org/project/netCDF4/#history), for some reasons. Could this be related to https://github.com/Unidata/netcdf4-python/issues/1343?

dwest77a commented 2 months ago

Couple of questions with the above, is the hdf5 library just installed with h5py or does it require a non-python library to be installed? Otherwise I'll just fix the h5py and netCDF4 versions in my environment and make a note of it. Looks like the versions fall out of sync just because of a lack of coordination.

dwest77a commented 2 months ago

I've backdated netCDF4 to 1.6.5 and also adjusted my scipy and numpy versions to match yours as well. It looked like I was making progress because I had a file that appeared which was about 6MB, but after 4-5 minutes the process exited with the same error as before (Can't write aggregated variable...) and the file disappeared.

Note: Immediately rerunning this process only took 10 seconds to reach the same error so I think those 4-5 minutes were fetching the data (if that's even supposed to happen here?)

davidhassell commented 2 months ago

Hi Dan, I just defer to netCDF4 to install the correct and consistent netCDF-C and HDF5 libraries, and that has, for many years, just worked ....

Strange about your results - the write took ~1 minute for me. Are you using the C libraries installed by the python packages?

dwest77a commented 2 months ago

I haven't done any extra steps to install alternative C libraries so I would assume yes, although I wouldn't know how to check.

dwest77a commented 2 months ago

My current environment setup for reference

asciitree==0.3.3
binpacking==1.5.2
ceda-elasticsearch-client==0.0.1
certifi==2024.7.4
cftime==1.6.4
cfunits==3.3.7
click==8.1.7
cloudpickle==3.0.0
dask==2024.7.0
elastic-transport==8.13.1
elasticsearch==8.14.0
fasteners==0.19
h5py==3.11.0
kerchunk==0.2.5
locket==1.0.0
mypy-extensions==1.0.0
netcdf-flattener==1.2.0
netCDF4==1.6.5
numcodecs==0.12.1
numpy==1.26.4
pandas==2.2.2
partd==1.4.2
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
rechunker==0.5.2
scipy==1.12.0
tabulate==0.9.0
toolz==0.12.1
tzdata==2024.1
ujson==5.10.0
zarr==2.18.2

-e git+ssh://git@github.com/NCAS-CMS/cf-python.git@ca69ad166109e1eba4d4fb816af41b8058fcaa10#egg=cf_python
-e git+ssh://git@github.com/NCAS-CMS/cfdm.git@4106b448adf87ccef7c5285ac8624daf60f9956b#egg=cfdm
-e git+ssh://git@github.com/fsspec/filesystem_spec.git@262f664574e091228251b467ac92b2a6c327034b#egg=fsspec
-e git+ssh://git@github.com/cedadev/padocc.git@72e8e3538bd8ffe335c900a4f718e998a8ec9a7a#egg=pipeline
-e git+ssh://git@github.com/dwest77a/xarray.git@bef04067dd87f9f0c1a3ae7840299e0bbdd595a8#egg=xarray