CINPLA / exdir

Directory structure standard for experimental pipelines.
http://exdir.rtfd.io
MIT License
72 stars 13 forks source link

support for json as a way to store attributes and metadata #111

Open mivanit opened 3 years ago

mivanit commented 3 years ago

yaml is a great format, and human readable, but is difficult to write a parser for, it would be very useful to be able to emit the attributes and metadata as json instead.

My need for this is because I am switching from HDF5 to exdir in a project of mine that uses both C++ and python, primarily because HDF5 is rather difficult to build in C++. The only up-to-date YAML library is unfortunately also rather complicated to build. In contrast, there are minimal header-only libraries for reading both .json and .npy files for C++. I am sure there are a number of other languages it might be easier to make exdir libraries for if json was an option.

Since exdir does not (to my knowledge) directly use the features differentiating YAML from JSON (comments, anchors/aliases, tags) it seems fairly easy to just replace the appropriate methods with those from the json module in the standard library.

I've created a quick and dirty implementation here: https://github.com/knc-neural-calculus/exdir/tree/json-metadata-attrs

Currently, the utils.serialize.MODE variable must be modified and then utils.serialize.refresh() called for json to be used, which is a very messy way to do it. I have not yet updated the tests, but it works fairly well for my current project. If this feature is something worth including in exdir, I would be more than happy to adjust my implementation to make it suitable for inclusion

dragly commented 3 years ago

Thanks for reaching out! First of all, have you considered using Zarr instead of Exdir? It is a bit more complex than Exdir, but might meet your needs out of the box: https://zarr.readthedocs.io/en/stable/

If Zarr is not an option for you, I am happy to discuss your proposal further. In fact, I have been contemplating whether we should move to JSON instead of YAML myself. While YAML has the benefit of being more human-readable, it does also have its drawbacks in this regard. Especially when the files get large and complex (figuring out the level of nesting based on indentation can be a real pain, for instance).

However, such a move requires some serious consideration regarding the transition.

The brute-force approach is to just switch to JSON and drop YAML altogether. This is not backwards-compatible and would at least require a major version bump. I would even consider it reasonable to rename the project to avoid problems for users who expect Exdir to be backwards-compatible.

Another approach is to support both YAML and JSON, and provide a conversion function to upgrade existing datasets. We could throw a deprecation warning and instruct the user to run the conversion function when encountering old files, while still reading them and writing to YAML inside them to retain backwards compatibility.

The third option is to provide the user with the option of picking YAML or JSON. This is basically what you have implemented in your proposal. I am a bit reluctant to this option, as it would mean that the standard is now split and every library needs to decide which version to support. The fact that the datasets in the wild could be different might also confuse users (and library authors).

mivanit commented 3 years ago

Thanks for the thoughtful response!

Unfortunately, zarr does not appear to have a C/C++ implementation, and implementing a reader looks like a very significant project that I am not willing to undertake. exdir appeals to me since it uses only existing formats (yaml, numpy array storage) and is thus easier to implement.

I agree that supporting both YAML and JSON is not an ideal option, whether the choice of format is exposed to users or not.

Maybe the best idea is a hard fork? I could provide a conversion function from exdir into this new format, and maybe this could be a chance to make other changes to the format.

I'm happy to lend a hand with whatever path you choose for exdir. I'll be maintaining my own fork for my own projects in the meantime. Let me know if you have a good name in mind!