Closed ppinchuk closed 3 months ago
What was the problem with passing numpy arrays through the dask __call__
method? Would be nice to at least subclass Regridder
for the cached implementation. Beyond that it make even make sense to go back to something closer to the Regridder
in sup3r/main with the caching functionality. Adding sup3r tests might make sense although rex already has so many tests lol.
The problem had to do with the da.concatenate(self.indices)
. I see you had to update the code to concat the inds and the re-shape the array back in the dask implementation (i.e. data[da.concatenate(self.indices)].reshape(new_shape)
as opposed to just data[np.array(self.indices)]
. Slicing like this (data[da.concatenate(self.indices)]
) fails if data
is a numpy array (at least in my testing).
If you want to see for yourself, you can re-produce the error by removing the da.from_array
call in test_regridding_basic
.
Also I guess you would have to call .compute()
on the output of the interpolation? This isn't too big of a deal but might catch people off guard if they pass in a numpy array and just expect to get one back.
As for the sub classing, I think it would make more sense to have everything in one implementation like it was before in that case.
Otherwise I personally think composition is the way to go, since we don't really expect CachedRegridder
to have to do any querying anyways.
Happy to discuss more about this though
Yeah thats what I figured the problem was. We could just add a check like isinstance(data, np.ndarray)
and modify accordingly? I think I would prefer that to a new call method. Agreed on the composition point. Nice tests too, btw
And this is where the plot thickens!
So I thought of that option and even implemented it at one point but something weird happened. Basically the test_regridding_with_dask
test started failing. Apparently isinstance(data, np.ndarray)
was evaluating to True
for the ws = res['windspeed_100m', ...]
data, so it was taking the numpy __call__
implementation, and therefore failing on the .compute()
call.
The weird part is that the dask implementation works just fine for this data (and I assume any other data read in from a h5 file?). I chalked this up to something tricky that h5py
was doing when reading the data but TBH I didn't dig in further in the interest of time.
So to recap, if I include the isinstance(data, np.ndarray)
branch in the __call__
method, it will catch both numpy array and data read from HDF5 files, the latter of which seems to be the main use case in sup3r? Please correct me if I am wrong here. If not though, it begs the question of if we even need a dask implementation at that point?
Really curious to hear your thoughts
BTW, I am very inexperienced with dask - is a da.from_array
call expensive? One alternative is to just cast all data
inputs to a dask array before processing, but I was afraid that this might be a big price to pay for large numpy arrays (i.e. time series for a large spatial extent)
And this is where the plot thickens!
So I thought of that option and even implemented it at one point but something weird happened. Basically the
test_regridding_with_dask
test started failing. Apparentlyisinstance(data, np.ndarray)
was evaluating toTrue
for thews = res['windspeed_100m', ...]
data, so it was taking thenumpy __call__
implementation, and therefore failing on the.compute()
call.The weird part is that the dask implementation works just fine for this data (and I assume any other data read in from a h5 file?). I chalked this up to something tricky that
h5py
was doing when reading the data but TBH I didn't dig in further in the interest of time.So to recap, if I include the
isinstance(data, np.ndarray)
branch in the__call__
method, it will catch both numpy array and data read from HDF5 files, the latter of which seems to be the main use case in sup3r? Please correct me if I am wrong here. If not though, it begs the question of if we even need a dask implementation at that point?Really curious to hear your thoughts
Ah yeah, that makes sense. I guess when you use the rex handlers res[dset]
will load into memory, thus converting to a numpy array? You could just do a hasattr(arr, 'compute')
check on return. I have a function in the refactor called compute_if_dask that I wrap arrays in and just checks for a compute method (https://github.com/NREL/sup3r/blob/51802a145c481fa3788994b95f27d80af9057ca8/sup3r/preprocessing/utilities.py#L118). On your second comment - I have Loader objects which cast everything to dask arrays with da.asarray(res.h5[dset])
(https://github.com/NREL/sup3r/blob/51802a145c481fa3788994b95f27d80af9057ca8/sup3r/preprocessing/loaders/h5.py#L83). Accessing with res.h5[dset]
doesn't load anything into memory and the asarray()
call has not added any noticeable slow down even on full WTK files. Maybe this sort of casting is the cleanest way to do this, as you mentioned. Just a single call method and not a burden for already loaded numpy objects, afaik.
Yea so I agree that casting asarray()
would be the cleanest solution by far, but I just tested it with some data and for some reason the dask implementatation is taking 11 minutes (and counting) to evaluate something that the numpy implementation handled in 4 seconds.
So I am inclined to keep 2 implementations, unless we really want to try to figure out WTH is going on under the hood.
If I go with the hasattr(arr, 'compute')
, would that cover the sup3r cases? I guess I just want to make sure I am not forcing sup3r to implicitly use the numpy implementation.
BTW, one more idea (and maybe the best one?) would be just to add a interpolate_numpy
method, and keep __call__
as purely dask. That was if someone (like me) wants to use numpy, they can just be explicit about it.
Thoughts?
Thanks @bnb32 for all the helpful feedback. I think the implementation is much cleaner now
@ppinchuk Found a new dask method that makes this even nicer! We can replace the dask piece with vals = data.vindex[self.indices]
. No reshape or concatenate needed, and we could do hasattr(data, 'vindex')
instead. Also we can do just data[self.indices]
for the numpy condition.
@ppinchuk Found a new dask method that makes this even nicer! We can replace the dask piece with
vals = data.vindex[self.indices]
. No reshape or concatenate needed, and we could dohasattr(data, 'vindex')
instead. Also we can do justdata[self.indices]
for the numpy condition.
Ooooo I will add this!
This looks great man. Enjoyed the discussion on best implementation.
Using cachedproperty here is a nice touch btw. Ran across this recently - one nice aesthetic advantage is avoiding the `if self.
is None: ...` pattern, right? Gotta work this into my code more. Slight mod - the arrays in
_InterpolationMixin
can be typedask.core.array.Array
andnp.ndarray
yea! I have really grown to like using cached_property
instead of the self._var = None
+ property implementation of var
pattern. Only major downside is that cached_property
is Python 3.8+, but I think we have moved all of our software away from 2.7, so I can start using it more now :)
Will update the docstrings!
This looks great man. Enjoyed the discussion on best implementation. Using cachedproperty here is a nice touch btw. Ran across this recently - one nice aesthetic advantage is avoiding the `if self.
is None: ... pattern, right? Gotta work this into my code more. Slight mod - the arrays in
_InterpolationMixincan be type
dask.core.array.Arrayand
np.ndarray`yea! I have really grown to like using
cached_property
instead of theself._var = None
+ property implementation ofvar
pattern. Only major downside is thatcached_property
is Python 3.8+, but I think we have moved all of our software away from 2.7, so I can start using it more now :)Will update the docstrings!
Yeah I think 3.8+ is totally fine. Python itself doesn't even maintain it anymore!
Porter sup3r
Regridder
class.The
Regridder
implementation should match the one in @bnb32's dh-refactor branch.To be honest, IDK if the
CachedRegridder
is the right approach here, but I did want two main pieces of functionality back:The idea is that the
Regridder
can be imported into sup3r as is, and then other use cases like HRRR can decide to use the cached implementation if needed.If necessary, a cached regridder with a dask
__call__
method can be hacked together (in a script or elsewhere):P.S. I see that rex runs
reV
andreVX
tests. Should it be runningsup3r
tests as well?