NeurodataWithoutBorders / pynwb

A Python API for working with Neurodata stored in the NWB Format
https://pynwb.readthedocs.io
Other
175 stars 85 forks source link

Iterative data write #14

Closed ajtritt closed 1 year ago

ajtritt commented 7 years ago

Originally reported by: Oliver Ruebel (Bitbucket: oruebel, GitHub: oruebel)


Problem: Iterative Write

While working on the iterative write for the BouchardLab data I came across a couple of gotchas. This ticket is primarily to track things to look out for and to spawn a discussion on how we should deal with the problem of iterative/streaming write:

  1. Which datasets should support iterative write? Datasets that we want to allow to be written iteratively must accept Iterable type input. It is likely that many classes currently only specify, e.b, ndarray and list as types for objects that one might want to write iteratively.

  2. Iterators and dimension checking Strict checking of dimensions may not be possible in init when we have an Iterable (rather than an array)

  3. Avoid loading all data from iterators We should not call list(object) on an Iterable in the constructor as that leads to loading of the data. E.g,. ephys.FeatureExtraction.init called self.fields['features'] = list(features)

  4. Iterators and read API Once we are dealing with read API's, allowing Iterables for datasets will likely get even more tricky. E.g., implementing array slicing when we are given an Iterable will be non-trivial (and doing so in every NWB API class will be problematic) Not an issue, because read occurs from NWB-HDF5 files only.

  5. Reusability It would be nice to implement as much of the iterative and streaming write behavior agnostic to HDF5 so that the mechanisms become reusable.

  6. Configurability The iterative write is not very configurable right now. E.g, one may want to a more save write option and flush the data regularly to ensure we have a valid file even if a data stream breaks for some reason?

  7. Simultaneous Streaming Write The current iterative write addresses the need were we don't want to load all data into memory in order to ingest data into a file (and in part the case where we only have a single data stream). However, it does not address the use-case of streaming data from an instrument. One central challenge seems to be that multiple dataset streams are not supported, i.e,. the iterative write is currently limited to one-dataset-at-a-time, however, data streaming will require simultaneous, iterative update of multiple datasets, e.g., even in a single TimeSeries one might need to update the "data" and "timestamps" as data arrives. This means, rather than iterating over one dataset until it is finished and then writing the next dataset, one will need to iterate over a collection of streams and update data as it arrives. NOTE: The datasets may also not be all part of the same group, e.g., when recording from multiple devices simultaneously one will want to create multiple timeseries in a streaming fashion.

Possible solutions: Brain Storming

re 5) Rather than writing datasets in an iterative fashion directly when a user asks to write a file, one could initialize all datasets that need to be written iteratively and then after all static data is done and all datasets have been initialized we could do the iteration over the datasets simultaneously. However, this behavior should be configurable, e.g., when streaming data from an instrument we have to do simultaneous streams but when we stream data from files (where the main purpose is to conserver memory) the current behavior of one-dataset-at-a-time is preferable.

re 1,3) This is something we should pay attention to now in order to avoid having to do repeated updates.

re 1-4) Maybe we should restrict the use of data iteration to allow only DataChunkIterators as iteratable inputs. This would remove the ambiguity of having to make complex decisions based on type and allow us to deal with these issues in a more structured fashion. The overhead for users to wrap an iterator in a DataChunkIterator is fairly low, so it seems Ok to make it a requirement. Also, this enforces that users make a distinct decision when they want to write data iteratively.

re 4) Maybe we need a mechanism that ensures that we allow Iterable objects only for write purposes. What would be the right mechanism for this? Or maybe, if we only allow DataChunkIterator, then we could possibly add mechanisms to load data into memory as needed?

re 6) One solution might be to put as much functionality into DataChunkIterator. To do the iteration over multiple iterators for a true simultaneous streaming write, we could maybe add class for managing DataChunkIterators that would deal with coordinating the write of the multiple streams, e.g, IterativeWriteManager. In this way different backends could implement their one IterativeWriteManagers that could reuse the functionality that is already there and just add their own custom write routines?


ajtritt commented 7 years ago

Original comment by Andrew Tritt (Bitbucket: ajtritt, GitHub: ajtritt):


Re 2: Yes, if you are talking about dimension checking at the time the constructor is called, we cannot

Re 5: I see what you are saying. As long as the write class/routine can take in a GroupBuilder object and a user provides a specification for some other format, then the code is reusable in that regard. However, given that there is little need to support other backends, I would prefer not to go down the rabbit hole of inventing a solution without having a defined problem

Re 6: The groundwork is being laid for this feature. We should focus tomorrow's discussion on detailing the use cases.

Re 7: Yeah, that's what I'm referring to. The asyncio provides the event loop framework, but we can probably get away with just using await/async. If we rely only on the await/async, then simultaneous write calls shouldn't be a problem. This framework is still constrained by GIL :). The only way for us to get simultaneous writes, would be to use HDF5+MPI-IO. This imposes two development hurdles. 1) Rank management with respect to different streams. Streams are only readable from the rank from which they are opened/created. 2) HDF5 metadata operations. The most relevant in regards to writing from iterators is resize. Metadata operations must be done collectively, imposing synchronization overhead. Apart from that, HDF5 supports simultaneous write.

ajtritt commented 7 years ago

Original comment by Oliver Ruebel (Bitbucket: oruebel, GitHub: oruebel):


Re 1,3: Lets bring this up during our next meeting to make sure this is also on Dave's radar when he writes classes.

Re 2: For streaming data, dimension checking may not be possible at all, at least as long as we rely on the ability to read the first chunk to determine dimensions. In the streaming case, the first chunk will likely not become available until later in the execution during the actual data write.

Re 4: You are correct. I removed issue 4 from the issue.

Re 5: I think as long as we can factor out the actual write into a function that a user can overwrite (or set), we would be a good step towards 5.

Re 6: We should chat tomorrow about how the multiple write works on the the 'read' branch

Re 7: I assume you are refering to: https://docs.python.org/3/library/asyncio.html. I was not aware of this new feature in Python 3, but it looks like it might be useful for this. One thing to keep in mind is that write calls to HDF5 will likely have to be atomic, i.e., tasks execute asynchronously but we can't have write calls happen at the same time.

We should chat tomorrow to determine, what we want to do now vs. later down the road.

ajtritt commented 7 years ago

Original comment by Andrew Tritt (Bitbucket: ajtritt, GitHub: ajtritt):


Re 1,3: These are mistakes/bugs that resulted from hastily written code.

Re 2: Dimension checking should be possible for all but one dimension.

Re 4: Because we are reading data that exists on disk, in NWB format, we will not allow reading iterators. We do not allow a mix if read and write on an open file-handle

Re 5: I agree this would be nice, but I do not believe it is currently in scope.

Re 6: Current work on the branch 'read' is being carried out to allow multiple "write calls" on a single NWB file. With this feature, users should be able to flush existing data.

Re 7: This is on the mental roadmap. I think we can leverage the async execution and coroutine features that are new as of Python3 to do this.

ajtritt commented 7 years ago

Original comment by Oliver Ruebel (Bitbucket: oruebel, GitHub: oruebel):


Just to avoid confusion, you may have noticed that I created a branch "streamingwrite" (from "dev") in which I have reworked the iterative write. The primary purpose of the changes on that branch, however, is to allow one to flexibly iterate over arbitrary dimensions and even arbitrary chunks. Those changes, make the iterative write more general and customizable but those changes do not address the items in this ticket. I'll describe the changes I've made on the branch in more detail when I generate the the pull request (probably some time tomorrow)

bendichter commented 6 years ago

@ajtritt close?