cosanlab / nltools

Python toolbox for analyzing imaging data
https://nltools.org
MIT License
120 stars 42 forks source link

Replace deepdish #416

Closed ejolly closed 8 months ago

ejolly commented 1 year ago

Ok @ljchang I took a crack at replacing deepdish for saving and loading .h5 Brain_Data and Adjacency objects, which addresses #415. Feel free to play around with this PR and if nothing too strange is happening we can marge it in!

ljchang commented 9 months ago

Seems to be working for me. let's go ahead and merge. (also not sure why my previous comments aren't showing up to do the same thing).

ejolly commented 8 months ago

Ok @ljchang I did some formal testing about trying to read in h5 files made with the current main version of nltools and this new proposed h5 fix, and it doesn't work 🙁 So if we merge this in then any h5 files created with previous versions of nltools will need old versions of nltools to be loaded, rendering them basically useless over time.

The reason seems to be that the way the deepdish does saving/loading is very different and more complicated from how h5py does which is pretty much like opening any other file: with H5File('brain.h5', 'r'):

As a result this new PR isn't backwards compatible. How do you want to proceed? A few thoughts:

ljchang commented 8 months ago

That's a bummer @ejolly . What specifically isn't working? I have tried loading older hdf5 files from the FNL dataset and they seem to be loading okay for me. Though I don't have any .X or .Y attributes stored in them.

ejolly commented 8 months ago

Yea it's primarily the X and Y attributes. We can't load those easily.

ljchang commented 8 months ago

ok, I think that is because they are using pytables to deal with the pandas stuff.

I have a couple of thoughts.

1) how hard does it look to try and emulate this aspect of deepdish? 2) how big of a deal is it if we don't support older hdf5 files that contain .X or .Y info? pickle has not done this (super annoying BTW). I think the pure numpy ones should be fine. We could just throw an error or warning if .X or .Y info is detected on the newer version that you will need to load this using an older version of nltools. I have no idea how often people even use this functionality. I personally never do because I'm usually just trying to make a faster version of a nifti file and those don't store this type of metadata anyway.

ejolly commented 8 months ago

@ljchang I was able to solve this now and verified it works across different nltools version! Brain_Data now accepts a legacy_h5 kwarg that when set to True will use pytables to mimic how deepdish was loading files. Otherwise it will use h5py which is the new default going forward.

The only thing we lose when loading h5 files from older versions of nltools is the column-type data for the X and Y attributes. Everything becomes at object type but I think that's probably acceptable and the best we can do.

I added a test for this and the last thing left to do when I get a chance is add support for legacy Adjacency hdf5 files.

ljchang commented 8 months ago

awesome! nice work!

ejolly commented 8 months ago

Legacy h5 support for Adjacency is now implemented. One thing to note is that new hdf5 files (for brain or adjacency) created with these updates are a bit larger than those created with deepdish because of the default compression (zlib) deepdish was using. h5py supports gzip natively which I have implemented as the default, but it might be worth looking into this helper library for further compression support: https://pypi.org/project/hdf5plugin/

Tests are only failing on python 3.7 which is fine because we're deprecating support for that version (and adding up to 3.11)