Open cboulay opened 4 years ago
Hello, implementing squeeze
is easy. Dropping dimensions is already happening with int indexing. If we have a dataset we want to squeeze, we can just pass it as dsetview.lazy_slice[:,0,:,:]
. e.g. dsetview.lazy_slice([0 if i==1 else slice(None) for i in dsetview.shape])
It's possible to assign dsetview.transpose
to dsetview.lazy_transpose
for one's use case, since h5py datasets do not have a transpose method, but if there's a different underlying class it'd override its transpose method.
I'm not too sure about integrating a general lazy reshape
. Perhaps when looking to reshape at chunk boundaries, in certain cases. PRs are welcome, but a general reshape
working along with lazy transpose and slicing might get too complicated.
To implement min
function and others, the data has to be read, or sequentially processed perhaps in chunks. Have you looked into dask?
Also, for transposing data in place fastremap maybe of help if memory constrained.
I spent much of the last week implementing lazy reshape. And yes, it was complicated. I ended up rewriting about 90% of the code. Though I got it working, and I tested quite a few combinations of transpose and reshape and slice, I'm sure some there are some corner cases where it will fail.
After I've had more time to play with it I'll push my changes to my fork, but it's such a huge change that I doubt a PR is what you want. I'll post here again when I feel it's ready for other eyes and you can let me know how you feel.
I took a quick look at dask but it didn't seem to meet my use case. I should look again. Cheers!
@cboulay .T
works as a lazy transpose. Let us know if you come up with anything for reshape. It sounds like a tough problem. Would be interested in incorporating if it doesn't dramatically increase the difficulty of supporting this package
I'm not quite ready to say it's suitable for a PR, but I'll post the main commit for reference in case I get otherwise distracted and someone wants these features without waiting for me to clean it up more: https://github.com/cboulay/lazy_ops/commit/1ef59e69a31453198710f2d5a561d6ce6ca20412
As I was working on it, I thought it would have been better to use strides
as the object's state variable to manage transpose & reshape rather than the solution I came up with. I'm sure someone cleverer than myself could get it to work and it would be more elegant than my solution and probably much more flexible.
I'm trying to see how far I can take my ~50 GB hdf5 datasets through my processing pipeline before explicitly creating an ndarray. My pipeline uses a framework (Neuropype) that puts the ndarray in a container along with some metadata and makes extensive use of ndarray functions returning views. I think I could get a lot further in this framework with my h5 dataset if a wrapper class like
DatasetViewh5py
reimplemented some of those ndarray functions that return views.Are there any downsides to renaming
lazy_transpose
totranspose
?Do you foresee any problems with a lazy implementation of
reshape
?I'm also considering a custom implementation of
squeeze
.numpy users expect
flatten()
to return a copy so probably not that one.What about
min
,max
,argmin
,argmax
,any
andall
when an axis is provided? Even though all of the data will have to be loaded into memory eventually, it can be done sequentially row-by-row (or column-by-column) so maybe this will help avoid out-of-memory errors. I am fairly new to processing data cached-on-disk so I'm hoping others with more experience can tell me if this is a bad idea from the outset.