NeurodataWithoutBorders / pynwb

A Python API for working with Neurodata stored in the NWB Format
https://pynwb.readthedocs.io
Other
174 stars 85 forks source link

New backend : Exdir #629

Open lepmik opened 5 years ago

lepmik commented 5 years ago

Hi!

I'm considering adding an additional backend exdir. Which should be one to one compatible with HDF5, so hopefully not to much work. I have spoken briefly about it with the NWB presenter at SfN 2017.

One issue, or question that comes up is how you determine attributes vs datasets. For example, session_description, sesison_start_time and file_create_date are stored as datasets, however, (file) source is an attribute. In my opinion all these are attributes.

So, when adding a new backend, is it possible to chose what should be stored as datasets and attributes, or is this "set in stone"?

ajtritt commented 5 years ago

I don’t think they’re set in stone. Changing them to attributes wouldn’t effect PyNWB much, and I don’t think it would change the user-facing API.

I think, in general, we would be open to changing these if it would make an alternative backend easier to implement.

@oruebel, @bendichter what are your thoughts?

t-b commented 5 years ago

@ajtritt @lepmik The dataset vs attribute discussion needs to take some intricacies of HDF5 into account. You can not compress attributes, can only read them in the whole and not change them an infinite number of times. See also https://github.com/NeurodataWithoutBorders/nwb-schema/issues/45#issue-253473032 for an earlier discussion on that topic.

oruebel commented 5 years ago

@lepmik does exdir support links, object references, and region references? I played around with implementing a ZARR-based backend at some point and while dropping in ZARR for h5py worked fine, dealing properly with links and datasets of object and region references requires effort since they are not natively supported by ZARR.

One issue, or question that comes up is how you determine attributes vs datasets.

From a backend perspective this is determined by the schema, i.e., this is a schema issue rather than being the issue of the backend.

See also https://github.com/NeurodataWithoutBorders/nwb-schema/issues/48 which discusses the proposed changes (and exceptions to the list you mentioned)

lepmik commented 5 years ago

@oruebel exdir does not support links, this is mainly to avoid cross platform issues e.g. symlinks in Windows vs Linux. However, object and region references should be trivial to implement as we are using .npy as dataset backend.

Are links absolutely necessary?

Sounds like we can edit the schema then to fit exdir more properly in terms of attributes vs datasets.

oruebel commented 5 years ago

Sounds like we can edit the schema then to fit exdir more properly

Changing the schema means changing the NWB format and as such effects everyone, not just exdir.

Are links absolutely necessary?

Ultimately a complete backend must implement all primitives of the specification language in some form, i.e., groups, attributes, datasets, links, all the various data types (including region/object references). Omitting a primitive means you cannot properly map the whole format.

However, object and region references should be trivial to implement as we are using .npy as dataset backend.

I'm not sure this is trival, but ultimately an object-reference is essentially a link stored as an element of a dataset. I.e. if you can do object-references you should be able to implement links.

ajtritt commented 5 years ago

Are links absolutely necessary?

Something like links is absolutely necessary. It is used to keep data normalized without implicit relationships. You could treat references and links the same, given that they are similar in spirit. As @oruebel said, references are treated as data, where as a link is treated as an object. References have the additional benefit of being "unbreakable". I can't think of instances where we rely on that distinction, but it might be something to keep in mind when adding this feature to exdir. We use references where it's more convenient to store the relationship as data, such as a column of a table or as an attribute.

One thing to keep in mind if you do go the route of treating links as references is how this will impact reading exdir data into the intermediate data translation layer. I suggest you read the overview of the NWB architecture for all the specifics. Briefly, any backend must read data into the Builder subtype that correspond to Spec subtype for which the data instantiates. For example, if something is specified to be a link (i.e. is a LinkSpec), then to properly translate that into the user API (i.e. Container objects), it must be read in by the backend as a LinkBuilder.

t-b commented 5 years ago

@lepmik Is there a document which desribes the mapping between the HDF5 types and the exdir types? Or alternatively a list of what is not supported in exdir?

lepmik commented 5 years ago

@t-b It's pretty much one to one, however there might be some slight differences in Dataset as can be seen here. You can find an overview of what's "missing" in terms of functionality with regards to h5py in the issues section. Our goal is to support everything from h5py, or at least as much as possible (feel free to contribute). You can also find more information in the paper.

bendichter commented 5 years ago

From the paper:

HDF5 has support for linking of objects, which is currently not part of the Exdir specification and will be added in the future.

So it seems the bridges to build are the ones already mentioned: links, object references, and region references

Another difference that may come up:

Finally, the reference implementation currently does not support parallel read/write operations on single objects. A future plugin is planned to provide such support.

HDF5 can do parallel operations on an individual dataset but not across datasets, so we've gone out of our way to design data structures accordingly, with data that could belong to multiple datasets concatenated into a single dataset (e.g. UnitTimes). Exdir has the opposite constraint- can parallelize across datasets but not within a dataset. This isn't a critical problem but may be something to think about as we formulate an NWB primitive -> exdir mapping

ajtritt commented 5 years ago

@bendichter I'm fairly certain HDF5 can do parallel operations across multiple datasets--you just don't get the performance benefits of collective operations. The restriction here is that metadata operations (i.e. creation of datasets) gets done collectively. I would think that exdir would need to impose a similar constraint, since a metadata operation is a filesystem operation.

Also, to clarify when we say "parallel"--HDF5 parallelism is only enabled at the process level, not at the thread level. HDF5 can be made thread safe, but you lose concurrency.

A @bendichter said, I'm not sure these things are relevant in the context of storage primitives, but should be considered if trying to develop a comparable backend replacement.

jakirkham commented 5 years ago

Had noted some discussion about using different backends here and noted Zarr came up, which we use and work on. As well as Exdir, which seems similar in some ways to Zarr. At the risk of bringing up a tangential discussion, am interested to discuss with Exdir about opportunities for our two communities to collaborate. Have raised issue ( https://github.com/zarr-developers/zarr/issues/334 ) for this purpose so as not to hijack this thread. 😄

bendichter commented 5 years ago

I just want to update the the exdir team on some developments that came out just before the NWB 2.0 release. Previously, we noted that there were 3 outstanding types of data structures that we use and would need to be somehow implemented by an alternative backend: links, object references, and region references. I want to let you know that we recently moved away from using region references. Previously, they were used in 2 places: VectorIndex and DynamicTableRegion. Both of these data relationships are now instead accomplished with an integer dataset storing region indices with an attribute that contains a link to the target dataset. We do not anticipate using region references anymore.

The remaining data relationships that need to be developed are links and object references (datasets that are an array of links). It seems like links could be implemented with filesystem soft-links as noted here https://github.com/NeurodataWithoutBorders/pynwb/issues/300 so the big remaining challenge is implementing object references.

oruebel commented 5 years ago

Just to avoid possible confusion, it is correct that the NWB:N 2.0 core schema does not use region references any more. However, region references are still supported, i.e., users can still use them in extensions. In short, to support NWB:N 2.0, region references are not critical (only links and object references) but for full support it will still be useful to support region references as well. So at least for a first go at integrating exdir, focusing on supporting links and datasets of object references will be an excellent start. Hopefully this will help to simplify the problem for integrating exdir.

bendichter commented 5 years ago

Yes thanks for the clarification. Region references have been removed from the core schema but have not been officially removed as a supported type, so they could be used by extensions (though I am not aware of any extensions that use them)