Thomas-Moore-Creative / NCI-ACCESS-S2-ARD

progress towards analysis ready data (ARD) for the ACCESS-S2 collection at NCI
GNU General Public License v3.0
1 stars 0 forks source link

Loading Reanalysis data // Find solution to failing xr.open_mfdataset #1

Closed Thomas-Moore-Creative closed 3 years ago

Thomas-Moore-Creative commented 3 years ago

Attempting to build tools to load ACCESS-S2 datasets on NCI at /g/data/ux62/access-s2/reanalysis/ocean/

Loading years 1981-2009 works fine:

file_list =[]
regexp = re.compile('mo_td_(198[1-9]|199[0-9]|200[0-9]).nc')
ROOT_DIR ='/g/data/ux62/access-s2/reanalysis/ocean/td/'
for root, dirs, files in os.walk(ROOT_DIR):
    for file in files:
        if regexp.search(file):
            file_list.append(os.path.join(root, file))

file_list.sort()
file_list
ds_1981_2009 = xr.open_mfdataset(file_list,parallel=True)
ds_1981_2009

And loading 2010-2018 works fine:

regexp = re.compile('mo_td_(201[0-8]).nc')
ROOT_DIR ='/g/data/ux62/access-s2/reanalysis/ocean/td/'
for root, dirs, files in os.walk(ROOT_DIR):
    for file in files:
        if regexp.search(file):
            file_list.append(os.path.join(root, file))

file_list.sort()
file_list
ds_2010_2018 = xr.open_mfdataset(file_list,parallel=True)
ds_2010_2018

But loading across this timeline (or merging) results in killed workers and failures?

Thomas-Moore-Creative commented 3 years ago

I think I found the issue and I’m 99% sure a pre-processing step will fix the problem in lazily loading an entire ACCESS-S2 reanalysis variable using python.

It seems that for most / all variables the 2015 netcdf file in the NCI collection is missing some extra file variables and associated dimensions? Example: compare ncdump for the following files looking for the ncorners dimension and lat_bounds & lon_bounds variables.

ncdump -h /g/data/ux62/access-s2/reanalysis/ocean/u/mo_u_1981.nc
ncdump -h /g/data/ux62/access-s2/reanalysis/ocean/u/mo_u_2014.nc
ncdump -h /g/data/ux62/access-s2/reanalysis/ocean/u/mo_u_2015.nc
ncdump -h /g/data/ux62/access-s2/reanalysis/ocean/u/mo_u_2016.nc
ncdump -h /g/data/ux62/access-s2/reanalysis/ocean/u/mo_u_2018.nc

Only 2015 appears different: missing the bounds variables and extra dimensions.

For tools like xarray this inconsistency in NetCDF file structure can confuse loading operations, blowing up memory and killing workers.

Thomas-Moore-Creative commented 3 years ago

This approach provides a solution until BOM updates file structure inconsistencies. https://gist.github.com/Thomas-Moore-Creative/ee5af1b6f3db9d0df0b3c3e5b7f02a7d