NaNs in data set for large number of files.

mjlosch commented 2 years ago

I am trying to read 310 files with 6 3D llc90 fields each (i.e. [50,13,90,90] grid points per 3D field) with

ds = open_mdsdataset('../mydirectory',prefix=['diags3D'],delta_t=deltat,ref_date=refdate,geometry='llc')

This works fine until I increase the number files diags3D to larger than 186 (not sure why this number). When I do that the variables with ds.isel(time=range(186)) contain only nan, but the variables with ds.isel(time=range(186,310)) are OK. E.g. dst0.iter[185].values gives nan, but dst0.iter[186].values gives the correct "iteration" number, same for 3D fields. ds.time is correct for all records.

Any idea what's going on?

mjlosch commented 2 years ago

User error. What a surprise! The first 186 files diags3D.* contained 6 fields, the remaining 124 files contained 7 fields. In my defence, the variables that contained NaN were in all files, but the extra field seems to have upset xarray.

Is there a way to make open_mdsdataset (actually xarray.combine_by_coords) ignore the extra variable (not available in all files) in such a case? I tried the flag data_vars=minimal, but that didn't help. My solution to this is now to post-process all 186 files to add a dummy field with zeros before I can use open_mdsdataset, but that's really not optimal.

rabernat commented 2 years ago

Hi Martin and sorry for the slow reply here!

I tried the flag data_vars=minimal, but that didn't help.

That would have been my best guess. 😞 Too bad it didn't work.

My recommendation for a quick workaround would be to open the files as two different datasets (specifying iters manually in open_mfsdataset), harmonize them however you prefer, and then manually concatenate them with xarray.

mjlosch commented 2 years ago

Thanks Ryan, I think it should be possible to fix this (so that only the extra fields have NaN and not everything). I tried to combine just two dataset with different fields (read with open_mdsdataset) with xarray.combine_by_coords, and there it works as expected (somehow: the extra field is NaN where it was not available, but all other fields are OK). When I read these two directly with open_mdsdata, then I can reproduce the problem. I am not sure where this happens.

mjlosch commented 2 years ago

I have the impression that this works: ds = xr.combine_by_coords(datasets, compat='no_conflicts', coords='minimal', combine_attrs='drop_conflicts') instead of https://github.com/MITgcm/xmitgcm/blob/d06d9851356568910ece576b2757e2b36bbd670f/xmitgcm/mds_store.py#L265

but it appears to be much slower; for 3 files (of my 6xllc90-3D-field) I get this timing: compat='override' : 0.25 sec compat='no_conflicts': 33.05 sec compat='no_conflicts', data_vars='minimal': 20.02 sec

That's probably not acceptable.

mjlosch commented 2 years ago

Just for the record: When I try the above combination (compat='no_conflicts', coords='minimal') on my entire data set I get this error after some waiting time:

Traceback (most recent call last):
  File "/home/ollie/mlosch/MITgcm/xmitgcm/xmitgcm/mds_store.py", line 202, in open_mdsdataset
    iternum = int(iters)
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ollie/mlosch/MITgcm/MITgcm/idemix_test/pt.py", line 13, in <module>
    ds = xmitgcm.open_mdsdataset(tdir,prefix=['diags3D'], #iters=[3243360,3260880,3278400],
  File "/home/ollie/mlosch/MITgcm/xmitgcm/xmitgcm/mds_store.py", line 267, in open_mdsdataset
    ds = xr.combine_by_coords(datasets, compat='no_conflicts', data_vars = 'minimal', coords='minimal', combine_attrs='drop_conflicts')
  File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/combine.py", line 986, in combine_by_coords
    return merge(
  File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/merge.py", line 905, in merge
    merge_result = merge_core(
  File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/merge.py", line 640, in merge_core
    variables, out_indexes = merge_collected(
  File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/merge.py", line 242, in merge_collected
    merged_vars[name] = unique_variable(name, variables, compat)
  File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/merge.py", line 144, in unique_variable
    out = out.compute()
  File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/variable.py", line 475, in compute
    return new.load(**kwargs)
  File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/variable.py", line 451, in load
    self._data = as_compatible_data(self._data.compute(**kwargs))
  File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/dask/base.py", line 288, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/dask/base.py", line 572, in compute
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
  File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/dask/base.py", line 572, in <listcomp>
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
  File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/dask/array/core.py", line 1164, in finalize
    return concatenate3(results)
  File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/dask/array/core.py", line 5003, in concatenate3
    result = np.empty(shape=shape, dtype=dtype(deepfirst(arrays)))
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 6.08 GiB for an array with shape (310, 50, 13, 90, 90) and data type float32

so the 'no_conficts' option is too expensive time- and memory-wise.

mjlosch commented 2 years ago

Ryan, here's a minimal example that illustrates, what happens:

import xarray as xr
import numpy as np

print("xr-version: "+xr.__version__)

nx=1
x = np.arange(nx)

ds0 = xr.Dataset(
    data_vars={"foo": (["time", "x"], np.random.rand(1,nx))},
    coords={"time": [0], "x": x},
    )

ds1 = xr.Dataset(
    data_vars={"foo": (["time", "x"], np.random.rand(1,nx)),
               "ffu": (["time", "x"], np.random.rand(1,nx))},
    coords={"time": [1], "x": x},
    )

datasets = [ds0,ds1]
[print("foo(%i)=%f"%(i,datasets[i].foo.values)) for i in range(2)]
[print("ffu(%i)=%f"%(i,datasets[i].ffu.values)) for i in range(1,2)]

print("compat='override':")
print(xr.combine_by_coords(datasets, compat='override'))
print("compat='no_conflicts':")
print(xr.combine_by_coords(datasets, compat='no_conflicts'))

When I run this, I get:

xr-version: 2022.3.0
foo(0)=0.735614
foo(1)=0.662675
ffu(1)=0.793732
compat='override':
<xarray.Dataset>
Dimensions:  (time: 2, x: 1)
Coordinates:
  * time     (time) int64 0 1
  * x        (x) int64 0
Data variables:
    foo      (time, x) float64 nan 0.6627
    ffu      (time, x) float64 nan 0.7937
compat='no_conflicts':
<xarray.Dataset>
Dimensions:  (time: 2, x: 1)
Coordinates:
  * time     (time) int64 0 1
  * x        (x) int64 0
Data variables:
    foo      (time, x) float64 0.7356 0.6627
    ffu      (time, x) float64 nan 0.7937

So with no "no_conflicts", it gives an "acceptable" result, but for "override" the data for the dataset without the additional variable ffu is all NaN (this independent of the order of the datasets in time). Is that the expected behaviour of "override"?

mjlosch commented 2 years ago

Hi again, sorry for not letting you even reply to my previous post. I have a suggestion that fixes the problem (for me): After the recursive loop for datasets https://github.com/MITgcm/xmitgcm/blob/d06d9851356568910ece576b2757e2b36bbd670f/xmitgcm/mds_store.py#L239-L240 I add:

                # drop all data variables not common to the datasets,
                # this should be done by xr.combine_by_coords, but
                # with the "overide" option, it does not work properly
                this_set = set(datasets[0].data_vars)
                for ds in datasets:
                    this_set = set(ds.data_vars) & this_set
                for k, ds in enumerate(datasets):
                    vars_to_drop = set(ds.data_vars) ^ this_set
                    if len(vars_to_drop) > 0:
                        print("dropping variables: ",vars_to_drop)
                        datasets[k] = ds.drop_vars(vars_to_drop)

If you think that this is good idea, I can create a PR

timothyas commented 2 years ago

Closed via #304

MITgcm / xmitgcm

NaNs in data set for large number of files. #303