NeurodataWithoutBorders / matnwb

A Matlab interface for reading and writing NWB files
BSD 2-Clause "Simplified" License
49 stars 32 forks source link

Removing data from an already existing nwb file #510

Closed GoktugAlkan closed 1 year ago

GoktugAlkan commented 1 year ago

Hello,

Currently, I am trying to remove data from an already existing nwb file. There is an nwb file that is stored on the disk and that I load with nwbRead. I want to remove the data in the field nwb.units. However, I couldn't find a method like the pop-method that exists in pyNWB (explained here). Is there a way to resolve this issue?

Background of issue: We created nwb files containing raw data and information on spike times/spike clusters/waveforms that are stored in nwb.units as proposed in your tutorials/explanations. After creating these files, we tuned again our spike sorting algorithm to get cleaner units. Hence, we need to change the information stored in nwb.units. That's why I am trying to delete the data in this field and to populate the field with the latest information about the spikes. We want to avoid the creation of a file from scratch.

Many thanks in advance!

GoktugAlkan commented 1 year ago

Is there an update concerning this issue?

lawrence-mbf commented 1 year ago

Sorry for not getting back to you last week.

By default MatNWB should support rewriting attributes. For datasets, the rewrite must be the same shape as before.

lawrence-mbf commented 1 year ago

By the way, this is done by exporting an open NWB file to its original file location so you don't have to create the Nwb object from scratch.

GoktugAlkan commented 1 year ago

@lawrence-mbf Thanks for the response. The problem is that the dataset that we want to insert may be significantly different than the one that was already existing. For example, we may have 1000 spikes less than before. Hence, the shape of the dataset would not be preserved.

Therefore, instead of overwriting this field, my idea was to delete all data inside nwb.units first and then populate this field with our revisited spike information. Finally, I wanted to store this file on the same location.

What would be the best way to realize this?

Thanks in advance!

lawrence-mbf commented 1 year ago

@GoktugAlkan For attributes you can always use H5A.delete for data you don't need.

For datasets there's currently no way that I've found to rewrite AND resize the data without using chunking. I wonder what pynwb is actually doing under the hood for the pop method because even low-level MATLAB calls are unable to delete dataset containers as far as I can tell.

The only other way I've found is by "unlinking" the data and repacking: https://www.mathworks.com/matlabcentral/answers/395920-how-can-i-delete-a-dataset-completely-from-a-group-in-a-hdf5-file

No clue how performant this actually is.

GoktugAlkan commented 1 year ago

@lawrence-mbf Thanks a lot. With the functions in the provided link I am able to delete the field nwb.units. I will inform you soon with my final conclusion.

bendichter commented 1 year ago

@GoktugAlkan keep in mind there is an oddity of HDF5 that deleted objects still take up space in the file. In the case of a units table, this may not be a major problem, but it can be quite wasteful in some circumstances. To solve this, you should use the h5repack command line utility: https://manpages.ubuntu.com/manpages/lunar/man1/h5repack.1.html

GoktugAlkan commented 1 year ago

@bendichter Thanks! Concerning this point, I guess that when the data that I want to insert into the deleted field is bigger than the previous data (i.e. bigger than the space that the deleted object take up in the file) there should be no problem. Is this correct?

lawrence-mbf commented 1 year ago

@GoktugAlkan There is no such guarantee unless you use chunking and/or h5repack. Unlinking just removes references to the data but still keeps the allocated space around.

GoktugAlkan commented 1 year ago

@lawrence-mbf @bendichter Since I am preparing the pipelines for lab members who use MATLAB, it would be very convenient to write a MATLAB function to apply this repacking. However, it seems that this is not possible in this case. I will try to apply the repacking and inform you about the progress.

If possible, it would be nice to have a method like pop as in pyNWB that applies the resizing/repacking of the file.

bendichter commented 1 year ago

PyNWB has the same issue. If you want to remove a dataset and free the space, you either need to write everything to a new file or use h5repack.

GoktugAlkan commented 1 year ago

@bendichter @lawrence-mbf The repacking works. I tested this on an nwb file, where I deleted the acquisition field. After removing that field, the new nwb file still occupied the same storage. But after using h5repack, the storage of the nwb file decreased significantly.

GoktugAlkan commented 1 year ago

@bendichter @lawrence-mbf I did a final test where I added large datasets to nwb files that had been repacked before. This test also works. In addition, the repacked nwb files/repacked & re-populated nwb files can be read in pyNWB.

As I said before, it would be nice if you provided a function in matNWB that can handle the deletion of an nwb field and the repacking afterwards.

If you want we can close this issue. Thanks a lot!