Closed ajtritt closed 7 years ago
Currently data streams will be written sequentially, e.g.,., all data from one DataChunkIterator will be written before the next dataset gets written. This means, while multiple data streams are supported, the write happens in sequence, one stream after the other. This is fine in situation where the full stream is available (e.g., when converting data) but in particular for situations where data is generated simultaneously online (e.g., when recording data during an experiment) this can be limiting. This ticket describes the need, to be able to write from multiple data streams simultaneously. Technically this is challenging for several reasons:
See also Issue #14
Regarding issue 1 (fragmentation) I think this reflects that writing multiple datasets simultaneously is actually the wrong solution for the problem of storing data directly from acquisition systems into NWB stored in HDF5. It means breaking how HDF5 is designed to work well.
A better approach might be to store all raw acquisition data streamed into a single large dataset, so that chunking etc works naturally and efficiently. The data can then either (i) be moved once acquisition has finished into the NWB layout (which might give performance benefits for later processing, but is a potentially expensive one-off operation), or (ii) references used to link datasets in the NWB layout to the relevant hyperslabs in the raw acquisition dataset (a quick operation, but may require new features in the NWB schema, and may result in slower access to these individual datasets).
@jonc125 thanks for the suggestion. Do you mean writing individual datasets to individual files? I'm just not sure how we would deal with writing datasets of different dimension to a single dataset. A separate data structure would need to be maintained to know the layout of the monolithic dataset. I am also confused about what you mean when you say "naturally and efficiently". Does HDF5 chunk better when it only has a single dataset to deal with at a time?
I guess it does get more complicated when you have data coming from multiple instruments and hence multiple sampling rates (potentially). I'd been envisioning that you could define an n-d array containing all the data having a common time reference, and this would avoid the problem. Unfortunately I'm not an expert on what HDF5 does internally when writing datasets to disk. I'd expect that if you can define the full dataset size up-front, it'll ensure the data isn't fragmented within the file, but if you have an unlimited dimension which gets extended then fragmentation is likely to occur. Raw disk access will also be slower if you're writing to either multiple datasets or multiple files simultaneously, because the writes are very unlikely to be to contiguous bits of disk and so the hardware/filesystem caching won't be as effective.
There will be approaches to this in software that currently stores data from instruments; it's probably worth reviewing these to see whether any are adaptable to NWB. But I maintain my inclination that NWB is best suited (and I gather designed) as an exchange format for data, not necessarily something to write directly when acquiring. I don't think it should try to do everything :)
Do we actually have any candidate acquisition systems that want to write directly to HDF5? IME, acquisition systems prefer to use their own format optimized to the throughput constraints of the acquisition system? I'm 100% with @jonc125 that we should treat NWB as an exchange format. this sounds like feature creep to me & I fear that a lot of time would be spent developing something that never gets used.
In our experiments, we have between 2 and 8 computers acquiring neural and behavioral data simultaneously. Each computer has it's own SSD drives that the acquisition software is writing raw data to. After recording, a common signal is used to temporally align the different datastreams. I can't envision how we would write all of this data simultaneously, let alone a general purpose approach that would be baked into pynwb.
Thanks for the comments. I view this ticket as an item for discussion of a possible future enhancement, i.e., I don't think this is on the critical path for the first release. I mainly added a comment to the ticket yesterday because I was triaging tickets and thought the ticket needed a bit more clarification.
In some sense the question is, whether this kind of mode of operation (i.e., simultaneous write of multiple data streams to a single file) is something that we want encourage or discourage. There are certain operations, e.g., modifying, expanding, and adding objects (e.g., datasets) to files that HDF5 makes much easier than most other formats. However, with that power comes great responsibility, e.g, to manage the risks of corrupting files when we encounter errors, fragmentation of files when modifying objects etc. One passive measure to manage those risks is establish use guidelines that avoid these kind of operations.
With regard to the specific use case of acquisition systems, it seems that in practice, acquisition system would likely want to route different data streams at least initially to different files anyways, if only to avoid possible errors, dependencies, and data corruptions. Different files could then always be linked together via a "master" HDF5 file with external links to the other files (or merged). That being said, supporting streaming data directly to HDF5 appears to be something that is on the roadmap for HDF5 (e.g, via the Single Write/Multiple Read feature https://support.hdfgroup.org/HDF5/docNewFeatures/SWMR/HDF5_SWMR_Users_Guide.pdf )
Long story short, right now this issue seems to be a bit in the abstract (i.e, this seems to be a feature that is technically possible but for which we do not have a concrete user need or use case that cannot be solved without it right now). I think at least for now the appropriate strategy for this item may be to:
Yeah, I like the plan for 1 & 2.
I'd also add that were we to "solve" this before we have support for a non-HDF5 backend, we'd probably need to reinvent the solution anyway.
I'm closing this ticket for now. I've filed ticket #81 to address part 1 and we can reopen this ticket if/when 2 comes up.
Originally reported by: Andrew Tritt (Bitbucket: ajtritt, GitHub: ajtritt)
In the event that multiple datasets are written from data iterators, we will need to support writing from multiple data iterators simultaneously.