data compression - Githubissues

bendichter commented 6 years ago

Is there any way to use matnwb to tell HDF5 to apply compression to a dataset?

Breaking it down, the following will be needed for implementation:

[x] decide on user interface. My vote would be to mirror pynwb and create a data object that can hold information about compression parameters. Would DataStub work for this?
[x] implement data compression when writing the data to the HDF5 file according to the parameters provided. See https://github.com/NeurodataWithoutBorders/matnwb/issues/50#issuecomment-414756686
[x] roundtrip tests.
[ ] documentation
- [ ] informative function comments
- [x] including in a tutorial

lawrence-mbf commented 6 years ago

Not at the moment.

On Sat, Jul 21, 2018, 15:50 Ben Dichter notifications@github.com wrote:

Is there any way to use matnwb to tell HDF5 to apply compression to a dataset?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/NeurodataWithoutBorders/matnwb/issues/50, or mute the thread https://github.com/notifications/unsubscribe-auth/Af6XPwO6EBmvKZNLeSh00kzZK_zp0fwIks5uI4XqgaJpZM4VZvfG .

bendichter commented 6 years ago

@ln-vidrio I'm working on using matnwb to convert MWorks data from the Movshon Lab and it's going well, but I could really use this feature.

lawrence-mbf commented 6 years ago

I understand the concern but at the current moment I am unable to work on that right now. I can describe the gist of what needs to be done in the future but I cannot guarantee completion within any deadlines.

Reading: This is technically already done as MATLAB will automatically decompress the data once you load it. Writing: The low level is very simple, it simply requires setting the filter using H5P.set_chunk and H5P.set_deflate before creating the dataset. Again, the issue is where the settings are stored. User Interface: This is the bigger issue, there needs to be a way to retain and save the compression settings for a given dataset. This can either be a wrapper around raw data or a flag within a datastub.

If all you want is to work with smaller files, then you can do this hack: in io.writeDataset add a chunk_dims argument to the function header and add the following:

function writeDataset(fid, fullpath, type, data, chunk_dims)
if nargin < 5 || isempty(chunksz)
  dcpl = 'H5P_DEFAULT';
else
  dcpl = H5P.create('H5P_DATASET_CREATE');
  H5P.set_chunk(dcpl, chunk_dims);
  H5P.set_deflate(dcpl,5);
end
...
did = H5D.create(fid, fullpath, tid, sid, dcpl);
H5D.write(did, tid, sid, sid, 'H5P_DEFAULT', data);

Next, go to your class's export function and search for the property you wish to compress. For instance, if we wanted to compress data in TimeSeries:

...
elseif ~isempty(obj.data)
io.writeDataset(fid, [fullpath '/data'], class(obj.data), obj.data, [1024 1024]);
end
...

To do this with DataStubs, just add an optional field called chunk_dims and do the same check and H5P calls before H5D.create is called.

oruebel commented 6 years ago

Just for reference, in PyNWB this is done with a data wrapper H5DataIO, i.e., a user wraps the array they want to write with H5DataIO and then they can set any HDF5-specific options there.

https://github.com/NeurodataWithoutBorders/pynwb/blob/c1a616fd12c9e0e56fa847cbe607bc7511098ef1/src/pynwb/form/backends/hdf5/h5_utils.py#L196-L291

bendichter commented 6 years ago

@ln-vidrio OK thanks for the info. I understand you have other stuff going on. Let me know when you have time to work on this again. @oruebel thanks for the note. Making the APIs similar would certainly make the learning curve easier.

NeurodataWithoutBorders / matnwb

data compression #50