ContinuumIO / xarray_filters

A Pipeline approach to chaining common xarray data structure conversions
3 stars 10 forks source link

from_features is broken if rows are dropped after to_features #27

Closed PeterDSteinberg closed 6 years ago

PeterDSteinberg commented 7 years ago

to_features calls .ravel on each DataArray to create a MultiIndex on the row dimension, typically called space. Then from_features can be called to reshape each column of the .features DataArray back into ND arrays. Currently this fails, but shouldn't, when rows are dropped from the .features DataArray.

To replicate, make a random MLDataset that has >1 DataArray and call .to_features(), then drop some rows, then try calling from_features. The problem is in the assumption about the shape of the DataArray(s) that need to be created.

gbrener commented 7 years ago

Looking into this now: when I remove a row from the .features DataArray, I get a ValueError "cannot reshape array of size N into shape (X, Y)" from here: https://github.com/ContinuumIO/xarray_filters/blob/master/xarray_filters/reshape.py#L243 It is related to the dims variable, which comes out of the multi_index.multi_index_to_coords function. Just wanted to confirm/elaborate on your issue, before investigating further.

PeterDSteinberg commented 7 years ago

Exactly, yes @gbrener . It cannot reshape it into size (X, Y) because the X, Y shape I assume is the original arrays shape before NaNs are dropped. After the drop, some points are missing and not so simple. Maybe after rows are dropped, we have to detect the unique coordinate tuples that remain, and construct an empty array of the right size, putting NaNs where the .features array lacks rows.

PeterDSteinberg commented 7 years ago

Another way to replicate:

At the bottom of MLDataset-Reshape-Examples do the following to create flattened MLDataset, add some NaN's to it, and make a clean MLDataset:

flattened = dset.chain([(example_agg, dict(dim='a')),
                         (example_agg, dict(dim='b')),
                         layers_example_with_kw,
                         layers_example_named_args,
                        ]).to_features()

flattened.features.values[0:5, 0] = np.NaN
clean = flattened.dropna('space')

Check that we dropped the 5 NaN rows from .features:

clean.features.shape, flattened.features.shape
((835, 1), (840, 1))

Call from_features() now and get a ValueError while attempting to reshape into the wrong size array template

clean.from_features()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-640961d30be5> in <module>()
----> 1 clean.from_features()

~/Documents/earth/xarray_filters/xarray_filters/mldataset.py in from_features(self, features_layer)
     39             raise ValueError('features_layer ({}) not in self.data_vars'.format(features_layer))
     40         data_arr = self[features_layer]
---> 41         dset = from_features(data_arr)
     42         return dset
     43 

~/Documents/earth/xarray_filters/xarray_filters/reshape.py in from_features(arr, axis)
    240     dset = OrderedDict()
    241     for j in range(simple_np_arr.size):
--> 242         val = arr[:, j].values.reshape(shp)
    243         layer = simple_np_arr[j]
    244         dset[layer] = xr.DataArray(val, coords=coords, dims=dims)

ValueError: cannot reshape array of size 835 into shape (4,5,6,7)
gbrener commented 7 years ago

@PeterDSteinberg thanks for the code sample. I reproduced the issue, and think the problem may stem from how we're gleaning the coordinates/dimensions from the feature DataArray's MultiIndex (https://github.com/ContinuumIO/xarray_filters/blob/master/xarray_filters/multi_index.py#L56). Still working on what the solution should be.

gbrener commented 6 years ago

@PeterDSteinberg, I posted some possible approaches/solutions in the company chat. Here is a condensed testcase that we can add to this repo once we decide on a fix:

import numpy as np
import xarray as xr
from xarray_filters import MLDataset

X = MLDataset({'pressure': xr.DataArray(np.random.uniform(0, 1, (2,3)),
                                        coords={'x': np.arange(2),
                                                'y': np.arange(3)},
                                        dims=['x', 'y']),
               'temperature': xr.DataArray(np.random.uniform(0, 1, (2,3)),
                                           coords={'x': np.arange(2),
                                                   'y': np.arange(3)},
                                           dims=['x', 'y'])})

print(X)
features = X.to_features()
print('X.to_features():', features)
data1 = features.from_features()
print('X.to_features().from_features():', data1)

assert np.array_equal(data1.coords.to_index().values, X.coords.to_index().values)
np.testing.assert_allclose(data1.to_array()[0], X.to_array()[0])

features['features'].values[:2, 0] = np.nan
print(features['features'])
features = features.dropna('space')
data2 = features.from_features()
gbrener commented 6 years ago

This is fixed as of PR #39 . Closing.