Closed mjlosch closed 2 years ago
User error. What a surprise! The first 186 files diags3D.*
contained 6 fields, the remaining 124 files contained 7 fields. In my defence, the variables that contained NaN were in all files, but the extra field seems to have upset xarray.
Is there a way to make open_mdsdataset (actually xarray.combine_by_coords
) ignore the extra variable (not available in all files) in such a case?
I tried the flag data_vars=minimal
, but that didn't help.
My solution to this is now to post-process all 186 files to add a dummy field with zeros before I can use open_mdsdataset, but that's really not optimal.
Hi Martin and sorry for the slow reply here!
I tried the flag
data_vars=minimal
, but that didn't help.
That would have been my best guess. 😞 Too bad it didn't work.
My recommendation for a quick workaround would be to open the files as two different datasets (specifying iters
manually in open_mfsdataset
), harmonize them however you prefer, and then manually concatenate them with xarray.
Thanks Ryan, I think it should be possible to fix this (so that only the extra fields have NaN and not everything). I tried to combine just two dataset with different fields (read with open_mdsdataset) with xarray.combine_by_coords
, and there it works as expected (somehow: the extra field is NaN where it was not available, but all other fields are OK). When I read these two directly with open_mdsdata, then I can reproduce the problem. I am not sure where this happens.
I have the impression that this works:
ds = xr.combine_by_coords(datasets, compat='no_conflicts', coords='minimal', combine_attrs='drop_conflicts')
instead of
https://github.com/MITgcm/xmitgcm/blob/d06d9851356568910ece576b2757e2b36bbd670f/xmitgcm/mds_store.py#L265
but it appears to be much slower; for 3 files (of my 6xllc90-3D-field) I get this timing: compat='override' : 0.25 sec compat='no_conflicts': 33.05 sec compat='no_conflicts', data_vars='minimal': 20.02 sec
That's probably not acceptable.
Just for the record: When I try the above combination (compat='no_conflicts', coords='minimal'
) on my entire data set I get this error after some waiting time:
Traceback (most recent call last):
File "/home/ollie/mlosch/MITgcm/xmitgcm/xmitgcm/mds_store.py", line 202, in open_mdsdataset
iternum = int(iters)
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ollie/mlosch/MITgcm/MITgcm/idemix_test/pt.py", line 13, in <module>
ds = xmitgcm.open_mdsdataset(tdir,prefix=['diags3D'], #iters=[3243360,3260880,3278400],
File "/home/ollie/mlosch/MITgcm/xmitgcm/xmitgcm/mds_store.py", line 267, in open_mdsdataset
ds = xr.combine_by_coords(datasets, compat='no_conflicts', data_vars = 'minimal', coords='minimal', combine_attrs='drop_conflicts')
File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/combine.py", line 986, in combine_by_coords
return merge(
File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/merge.py", line 905, in merge
merge_result = merge_core(
File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/merge.py", line 640, in merge_core
variables, out_indexes = merge_collected(
File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/merge.py", line 242, in merge_collected
merged_vars[name] = unique_variable(name, variables, compat)
File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/merge.py", line 144, in unique_variable
out = out.compute()
File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/variable.py", line 475, in compute
return new.load(**kwargs)
File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/xarray/core/variable.py", line 451, in load
self._data = as_compatible_data(self._data.compute(**kwargs))
File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/dask/base.py", line 288, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/dask/base.py", line 572, in compute
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/dask/base.py", line 572, in <listcomp>
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/dask/array/core.py", line 1164, in finalize
return concatenate3(results)
File "/home/ollie/mlosch/miniconda3/envs/mitgcm/lib/python3.10/site-packages/dask/array/core.py", line 5003, in concatenate3
result = np.empty(shape=shape, dtype=dtype(deepfirst(arrays)))
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 6.08 GiB for an array with shape (310, 50, 13, 90, 90) and data type float32
so the 'no_conficts' option is too expensive time- and memory-wise.
Ryan, here's a minimal example that illustrates, what happens:
import xarray as xr
import numpy as np
print("xr-version: "+xr.__version__)
nx=1
x = np.arange(nx)
ds0 = xr.Dataset(
data_vars={"foo": (["time", "x"], np.random.rand(1,nx))},
coords={"time": [0], "x": x},
)
ds1 = xr.Dataset(
data_vars={"foo": (["time", "x"], np.random.rand(1,nx)),
"ffu": (["time", "x"], np.random.rand(1,nx))},
coords={"time": [1], "x": x},
)
datasets = [ds0,ds1]
[print("foo(%i)=%f"%(i,datasets[i].foo.values)) for i in range(2)]
[print("ffu(%i)=%f"%(i,datasets[i].ffu.values)) for i in range(1,2)]
print("compat='override':")
print(xr.combine_by_coords(datasets, compat='override'))
print("compat='no_conflicts':")
print(xr.combine_by_coords(datasets, compat='no_conflicts'))
When I run this, I get:
xr-version: 2022.3.0
foo(0)=0.735614
foo(1)=0.662675
ffu(1)=0.793732
compat='override':
<xarray.Dataset>
Dimensions: (time: 2, x: 1)
Coordinates:
* time (time) int64 0 1
* x (x) int64 0
Data variables:
foo (time, x) float64 nan 0.6627
ffu (time, x) float64 nan 0.7937
compat='no_conflicts':
<xarray.Dataset>
Dimensions: (time: 2, x: 1)
Coordinates:
* time (time) int64 0 1
* x (x) int64 0
Data variables:
foo (time, x) float64 0.7356 0.6627
ffu (time, x) float64 nan 0.7937
So with no "no_conflicts", it gives an "acceptable" result, but for "override" the data for the dataset without the additional variable ffu
is all NaN (this independent of the order of the datasets in time). Is that the expected behaviour of "override"?
Hi again, sorry for not letting you even reply to my previous post. I have a suggestion that fixes the problem (for me): After the recursive loop for datasets https://github.com/MITgcm/xmitgcm/blob/d06d9851356568910ece576b2757e2b36bbd670f/xmitgcm/mds_store.py#L239-L240 I add:
# drop all data variables not common to the datasets,
# this should be done by xr.combine_by_coords, but
# with the "overide" option, it does not work properly
this_set = set(datasets[0].data_vars)
for ds in datasets:
this_set = set(ds.data_vars) & this_set
for k, ds in enumerate(datasets):
vars_to_drop = set(ds.data_vars) ^ this_set
if len(vars_to_drop) > 0:
print("dropping variables: ",vars_to_drop)
datasets[k] = ds.drop_vars(vars_to_drop)
If you think that this is good idea, I can create a PR
Closed via #304
I am trying to read 310 files with 6 3D
llc90
fields each (i.e. [50,13,90,90] grid points per 3D field) withThis works fine until I increase the number files
diags3D
to larger than 186 (not sure why this number). When I do that the variables withds.isel(time=range(186))
contain onlynan
, but the variables withds.isel(time=range(186,310))
are OK. E.g.dst0.iter[185].values
givesnan
, butdst0.iter[186].values
gives the correct "iteration" number, same for 3D fields.ds.time
is correct for all records.Any idea what's going on?