"lazy" modular storage - Githubissues

NileGraddis commented 5 years ago

Here is a use pattern that I'm interested in. I don't know whether it is currently supported by pynwb or whether there is a better way to accomplish the same goals, so I'll just lay it out here.

Setting: The Allen Institute serves data over a public http API. The AllenSDK contains code for making these queries and caching the results locally. For our physiology projects, we are working on serving these data as NWB 2.0 files.

The data access and caching behavior in the AllenSDK is generally lazy - users only have to download data that they ask for.

Problem: How should I implement lazy downloading for NWB 2.0-formatted data? Here are some example use cases:

Each neuropixels (ecephys) session comes with sorted spike data & metadata (<=1gb) and 6 probewise LFP datasets (each ~1.5gb). I would like users to only have to download the LFP for a probe if they actually try to access it, since otherwise they will have to download 9gb of data they don't necessarily care about just to access the spikes.
We present common stimuli to the subjects of many sessions. It would be nice if these stimulus templates could be downloaded once (when they are actually asked for) and then referenced by each session that uses them.

Potential solutions

Make separable representations of these data and stitch them together in AllenSDK. I don't like this approach because the resulting NWB data would not be correctly accessible from non-AllenSDK tools.
Use dataset / group linking to split an NWBFile's contents across multiple h5 files and then lazily download them. Currently, this doesn't work, since broken links cause the reader to fail entirely.
broken links fail at access time, but can be wrapped by the reader code in a pre-access callback (which might download the required file). This is closest to my ideal solution, but seems pretty divergent from current behavior.

@ajtritt @oruebel @bendichter Is this a terrible idea (or already implemented in a way I don't know about, or already on the roadmap)? How can I support incremental download?

bendichter commented 5 years ago

adding @rly to the convo as well. He's new to the dev team and has already been very helpful with external link issues.

I'd say these are both jobs for external links. I'd like to build infrastructure around them so they work for you in both of these use-cases. Let's hone in on the precisely failure modes that are preventing you from using them in this way and see if we can resolve them for you.

Also I believe we have a more robust solution in the works that would interface directly with the AllenSDK, but that's further down the road.

oruebel commented 5 years ago

We should talk some more about this during the hackathon. I agree that 1 would be the least optimal option. 2 i think should be doable since the main problem is to have a work-around for dealing with broken external links in PyNWB. 3. is something that we have on the roadmap. The idea is support "foreign fields" (basically web-based external links). We have not started work on this yet, but we plan to do some planning for this during the hackathon. As such, I can't give you an exact timeline for this yet, but I would hope for something in the 6-12month timeframe, but we'll have to see.

I think it would make sense to look at fixing 2 during the hackathon to get you going for now and at the same time start planning the roadmap for 3.

NileGraddis commented 5 years ago

@bendichter @oruebel Thank you for the swift response. I like the idea of fixing 2 soonish (hackathon time sounds good) and working on 3 longer-term. I will post an example file and some code in this thread.

@rly Hi!

NileGraddis commented 5 years ago

working on 2 now.

NileGraddis commented 5 years ago

@oruebel @ajtritt @rly

I made some progress on this during the hackathon, but mainly in the direction of running into more problems :P. Here is what I've tried (the overall goal is to store big lfp data in satellite files):

set the data and timestamps of an electricalseries as external links to arrays / timeseries in another file. Problem: timeseries has some attributes which are actually stored on its data on disk but read onto the timeseries object at initial read time. Some ideas for progress in this direction are to try adding another layer of indirection to the linking (low confidence) or to see if h5 external links can have their own attributes. Another option is to try an lazily read these attributes off of the timeseries data in general and make that data an nwb type itself. I think the latter is a good path in general, but it also implies a large set of changes to timeseries and related types.
within the satellite files linking to the main file's electrode table. This falls over since the electrode table has a bunch of references to electrode groups which are not correctly dereferenced when reading the satellite file (I'm also not totally sure how to indicate to a reader of the main file that the satellite files are present in this case - I suppose just make a link to the lfp object).

Both of these require subclassing HDMFDataset so that reading the file does not immediately choke on construction failures (and that the failure on access provides useful information).

Anyways, this seems like a pretty hard problem and one that I don't have the bandwith to tackle alone (though am of course happy to help out). Maybe you guys have some ideas? This issue is pretty important for our upcoming data release - we can't really have people downloading 10gb files just to access the units table.

NeurodataWithoutBorders / pynwb

"lazy" modular storage #907