catalystneuro / lazy_ops

Lazy transposing and slicing of h5py and Zarr Datasets
BSD 3-Clause "New" or "Revised" License
3 stars 3 forks source link

Can it be made a more transparent drop-in for ndarray? #21

Open cboulay opened 4 years ago

cboulay commented 4 years ago

I'm trying to see how far I can take my ~50 GB hdf5 datasets through my processing pipeline before explicitly creating an ndarray. My pipeline uses a framework (Neuropype) that puts the ndarray in a container along with some metadata and makes extensive use of ndarray functions returning views. I think I could get a lot further in this framework with my h5 dataset if a wrapper class like DatasetViewh5py reimplemented some of those ndarray functions that return views.

Are there any downsides to renaming lazy_transpose to transpose?

Do you foresee any problems with a lazy implementation of reshape?

I'm also considering a custom implementation of squeeze.

numpy users expect flatten() to return a copy so probably not that one.

What about min, max, argmin, argmax, any and all when an axis is provided? Even though all of the data will have to be loaded into memory eventually, it can be done sequentially row-by-row (or column-by-column) so maybe this will help avoid out-of-memory errors. I am fairly new to processing data cached-on-disk so I'm hoping others with more experience can tell me if this is a bad idea from the outset.

d-sot commented 4 years ago

Hello, implementing squeeze is easy. Dropping dimensions is already happening with int indexing. If we have a dataset we want to squeeze, we can just pass it as dsetview.lazy_slice[:,0,:,:]. e.g. dsetview.lazy_slice([0 if i==1 else slice(None) for i in dsetview.shape])

It's possible to assign dsetview.transpose to dsetview.lazy_transpose for one's use case, since h5py datasets do not have a transpose method, but if there's a different underlying class it'd override its transpose method.

I'm not too sure about integrating a general lazy reshape. Perhaps when looking to reshape at chunk boundaries, in certain cases. PRs are welcome, but a general reshape working along with lazy transpose and slicing might get too complicated.

To implement min function and others, the data has to be read, or sequentially processed perhaps in chunks. Have you looked into dask? Also, for transposing data in place fastremap maybe of help if memory constrained.

cboulay commented 4 years ago

I spent much of the last week implementing lazy reshape. And yes, it was complicated. I ended up rewriting about 90% of the code. Though I got it working, and I tested quite a few combinations of transpose and reshape and slice, I'm sure some there are some corner cases where it will fail.

After I've had more time to play with it I'll push my changes to my fork, but it's such a huge change that I doubt a PR is what you want. I'll post here again when I feel it's ready for other eyes and you can let me know how you feel.

I took a quick look at dask but it didn't seem to meet my use case. I should look again. Cheers!

bendichter commented 4 years ago

@cboulay .T works as a lazy transpose. Let us know if you come up with anything for reshape. It sounds like a tough problem. Would be interested in incorporating if it doesn't dramatically increase the difficulty of supporting this package

cboulay commented 4 years ago

I'm not quite ready to say it's suitable for a PR, but I'll post the main commit for reference in case I get otherwise distracted and someone wants these features without waiting for me to clean it up more: https://github.com/cboulay/lazy_ops/commit/1ef59e69a31453198710f2d5a561d6ce6ca20412

As I was working on it, I thought it would have been better to use strides as the object's state variable to manage transpose & reshape rather than the solution I came up with. I'm sure someone cleverer than myself could get it to work and it would be more elegant than my solution and probably much more flexible.