Port sup3r `Regridder` - Githubissues

ppinchuk commented 3 months ago

Porter sup3r Regridder class.

The Regridder implementation should match the one in @bnb32's dh-refactor branch.

To be honest, IDK if the CachedRegridder is the right approach here, but I did want two main pieces of functionality back:

Querying all indices and distances once and then being able to read them from cache (this would speed up HRRR processing tremendously, since the meta coordinates remain unchanged across years)
Pure NumPy implementation of the interpolation, since I ran into some issues passing numpy arrays through the dask implementation.

The idea is that the Regridder can be imported into sup3r as is, and then other use cases like HRRR can decide to use the cached implementation if needed.

If necessary, a cached regridder with a dask __call__ method can be hacked together (in a script or elsewhere):

regridder = Regridder(source_meta, target_meta)
regridder._distances, regridder._indices = CachedRegridder.load_cache("./{array_name}.pkl")

P.S. I see that rex runs reV and reVX tests. Should it be running sup3r tests as well?

bnb32 commented 3 months ago

What was the problem with passing numpy arrays through the dask __call__ method? Would be nice to at least subclass Regridder for the cached implementation. Beyond that it make even make sense to go back to something closer to the Regridder in sup3r/main with the caching functionality. Adding sup3r tests might make sense although rex already has so many tests lol.

ppinchuk commented 3 months ago

The problem had to do with the da.concatenate(self.indices). I see you had to update the code to concat the inds and the re-shape the array back in the dask implementation (i.e. data[da.concatenate(self.indices)].reshape(new_shape) as opposed to just data[np.array(self.indices)]. Slicing like this (data[da.concatenate(self.indices)]) fails if data is a numpy array (at least in my testing).

If you want to see for yourself, you can re-produce the error by removing the da.from_array call in test_regridding_basic.

Also I guess you would have to call .compute() on the output of the interpolation? This isn't too big of a deal but might catch people off guard if they pass in a numpy array and just expect to get one back.

ppinchuk commented 3 months ago

As for the sub classing, I think it would make more sense to have everything in one implementation like it was before in that case.

Otherwise I personally think composition is the way to go, since we don't really expect CachedRegridder to have to do any querying anyways.

Happy to discuss more about this though

bnb32 commented 3 months ago

Yeah thats what I figured the problem was. We could just add a check like isinstance(data, np.ndarray) and modify accordingly? I think I would prefer that to a new call method. Agreed on the composition point. Nice tests too, btw

ppinchuk commented 3 months ago

And this is where the plot thickens!

So I thought of that option and even implemented it at one point but something weird happened. Basically the test_regridding_with_dask test started failing. Apparently isinstance(data, np.ndarray) was evaluating to True for the ws = res['windspeed_100m', ...] data, so it was taking the numpy __call__ implementation, and therefore failing on the .compute() call.

The weird part is that the dask implementation works just fine for this data (and I assume any other data read in from a h5 file?). I chalked this up to something tricky that h5py was doing when reading the data but TBH I didn't dig in further in the interest of time.

So to recap, if I include the isinstance(data, np.ndarray) branch in the __call__ method, it will catch both numpy array and data read from HDF5 files, the latter of which seems to be the main use case in sup3r? Please correct me if I am wrong here. If not though, it begs the question of if we even need a dask implementation at that point?

Really curious to hear your thoughts

ppinchuk commented 3 months ago

BTW, I am very inexperienced with dask - is a da.from_array call expensive? One alternative is to just cast all data inputs to a dask array before processing, but I was afraid that this might be a big price to pay for large numpy arrays (i.e. time series for a large spatial extent)

bnb32 commented 3 months ago

And this is where the plot thickens!

So I thought of that option and even implemented it at one point but something weird happened. Basically the test_regridding_with_dask test started failing. Apparently isinstance(data, np.ndarray) was evaluating to True for the ws = res['windspeed_100m', ...] data, so it was taking the numpy __call__ implementation, and therefore failing on the .compute() call.

The weird part is that the dask implementation works just fine for this data (and I assume any other data read in from a h5 file?). I chalked this up to something tricky that h5py was doing when reading the data but TBH I didn't dig in further in the interest of time.

So to recap, if I include the isinstance(data, np.ndarray) branch in the __call__ method, it will catch both numpy array and data read from HDF5 files, the latter of which seems to be the main use case in sup3r? Please correct me if I am wrong here. If not though, it begs the question of if we even need a dask implementation at that point?

Really curious to hear your thoughts

Ah yeah, that makes sense. I guess when you use the rex handlers res[dset] will load into memory, thus converting to a numpy array? You could just do a hasattr(arr, 'compute') check on return. I have a function in the refactor called compute_if_dask that I wrap arrays in and just checks for a compute method (https://github.com/NREL/sup3r/blob/51802a145c481fa3788994b95f27d80af9057ca8/sup3r/preprocessing/utilities.py#L118). On your second comment - I have Loader objects which cast everything to dask arrays with da.asarray(res.h5[dset]) (https://github.com/NREL/sup3r/blob/51802a145c481fa3788994b95f27d80af9057ca8/sup3r/preprocessing/loaders/h5.py#L83). Accessing with res.h5[dset] doesn't load anything into memory and the asarray() call has not added any noticeable slow down even on full WTK files. Maybe this sort of casting is the cleanest way to do this, as you mentioned. Just a single call method and not a burden for already loaded numpy objects, afaik.

ppinchuk commented 3 months ago

Yea so I agree that casting asarray() would be the cleanest solution by far, but I just tested it with some data and for some reason the dask implementatation is taking 11 minutes (and counting) to evaluate something that the numpy implementation handled in 4 seconds.

So I am inclined to keep 2 implementations, unless we really want to try to figure out WTH is going on under the hood.

If I go with the hasattr(arr, 'compute'), would that cover the sup3r cases? I guess I just want to make sure I am not forcing sup3r to implicitly use the numpy implementation.

BTW, one more idea (and maybe the best one?) would be just to add a interpolate_numpy method, and keep __call__ as purely dask. That was if someone (like me) wants to use numpy, they can just be explicit about it.

Thoughts?

ppinchuk commented 3 months ago

Thanks @bnb32 for all the helpful feedback. I think the implementation is much cleaner now

bnb32 commented 3 months ago

@ppinchuk Found a new dask method that makes this even nicer! We can replace the dask piece with vals = data.vindex[self.indices]. No reshape or concatenate needed, and we could do hasattr(data, 'vindex') instead. Also we can do just data[self.indices] for the numpy condition.

ppinchuk commented 3 months ago

@ppinchuk Found a new dask method that makes this even nicer! We can replace the dask piece with vals = data.vindex[self.indices]. No reshape or concatenate needed, and we could do hasattr(data, 'vindex') instead. Also we can do just data[self.indices] for the numpy condition.

Ooooo I will add this!

ppinchuk commented 3 months ago

This looks great man. Enjoyed the discussion on best implementation.

Using cachedproperty here is a nice touch btw. Ran across this recently - one nice aesthetic advantage is avoiding the `if self. is None: ...` pattern, right? Gotta work this into my code more.

Slight mod - the arrays in _InterpolationMixin can be type dask.core.array.Array and np.ndarray

yea! I have really grown to like using cached_property instead of the self._var = None + property implementation of var pattern. Only major downside is that cached_property is Python 3.8+, but I think we have moved all of our software away from 2.7, so I can start using it more now :)

Will update the docstrings!

bnb32 commented 3 months ago

This looks great man. Enjoyed the discussion on best implementation. Using cachedproperty here is a nice touch btw. Ran across this recently - one nice aesthetic advantage is avoiding the `if self. is None: ...pattern, right? Gotta work this into my code more. Slight mod - the arrays in_InterpolationMixincan be typedask.core.array.Arrayandnp.ndarray`

yea! I have really grown to like using cached_property instead of the self._var = None + property implementation of var pattern. Only major downside is that cached_property is Python 3.8+, but I think we have moved all of our software away from 2.7, so I can start using it more now :)

Will update the docstrings!

Yeah I think 3.8+ is totally fine. Python itself doesn't even maintain it anymore!

NREL / rex

Port sup3r `Regridder` #176