catalystneuro / IBL-to-nwb

Conversion of IBL data to NWB format.
BSD 3-Clause "New" or "Revised" License
2 stars 3 forks source link

CN/IBL TODO: discuss raw vs. processed separation #74

Closed CodyCBakerPhD closed 2 days ago

CodyCBakerPhD commented 2 weeks ago

And should 'extra' (potentially correctable in future revisions) metadata be associated with raw data, or removed so that we can officially publish a persistent version of the 30+ TB that will not need to be changed or duplicated going forward?

CodyCBakerPhD commented 1 week ago

@grg2rsr This is one of the more important things to consider here early on

The short summation:

Assets on DANDI, once officially 'published' (to mint a DOI) become persistent and frozen. For NWB Dandisets, an 'asset' is a single NWB file. The current approach is to bundle ALL metadata and processed data alongside the bulk of the raw data (the electrical series from the multiple NeuroPixels probes). This means that any time the metadata must be updated in a file containing raw data (perhaps corresponding to a recent 'revision' of the processed data), the entire file must be reuploaded and republished. This needlessly multiplies the amount of storage space taken on the S3 bucket

This is the reason why we've been waiting for clearance from your team to publish the current Dandiset, which though highly used is still in 'draft' because of the known issues with it.

What I propose is to simply write the bulk raw data (which will never change) once, to separate stand-alone files, that have minimal associated metadata

Then, any time a new data revision for the processed / histology / atlas / etc. is reconverted and reuploaded, you can simply republish those new files, which is much less data waste (and even kind of useful as a way to observe changes over versioned releases)

Please let me know what approach you prefer here ASAP so I can make adjustments in the next week or so

cc: @oliche @mayofaulkner @GaelleChapuis

grg2rsr commented 1 week ago

What I propose is to simply write the bulk raw data (which will never change) once, to separate stand-alone files, that have minimal associated metadata

The discussion on our side has been in agreement with this. If streaming is a performant and viable option, those might even be lumped together in a single file, so that there is a {eid}-processed-only.nwb and a {eid}-raw-only.nwb.

The raw-only will then contain all the fields that might change in future revisions and is comparatively lightweight, so that hosting multiple revisions might be an option.

The only problem with this was, if I understood from your side correctly, that the location of the probe insertion must be present in the nwb file that contains the ElectrodeGroup (which is the -raw-only.nwb), or is this now fixed / covered by #73?

CodyCBakerPhD commented 1 week ago

The only problem with this was, if I understood from your side correctly, that the location of the probe insertion must be present

We can leave it "" if the only purpose of a raw-only file is to store the electrical series

grg2rsr commented 4 days ago

seems to me like the best way forward

oliche commented 4 days ago

One of the reason for splitting the files here is that by nature, "raw" acquisition files are often bulkier and also much less likely to change than others. So here we would split the files by "datasets that don´t change" versus "datasets that may change". The motivation is to save space and allow revisions of pre-processed inputs without incurring the full cost of a re-upload. This is the what the discussion above seems to converge to, and I agree.

Yet there are user-centered reasons to make splits desirable: a user doesn't want to have to get the full raw data package if she wants only say the LFP band. In theory this can be addressed by a streaming strategy. Here we'd like to try out the user experience on one of our newly uploaded sessions, simulating different scenarios to decide if further splits are desirable (video / LFP / AP come to mind) depending on how it goes !

@CodyCBakerPhD here your expertise would be helpful to find the most appropriate way to access the data given a scenario !