NeuralEnsemble / elephant

Elephant is the Electrophysiology Analysis Toolkit
http://www.python-elephant.org
BSD 3-Clause "New" or "Revised" License
195 stars 92 forks source link

[Feature] Recommended workflow for storing results returned as `numpy.ndarray`? #544

Open joschaschmiedt opened 1 year ago

joschaschmiedt commented 1 year ago

Several functions that operate on AnalogSignal data return simple numpy.ndarrays, e.g.

Most users will probably want to store the results in some way. For the AnalogSignal outputs, Neo provides easy-to-use saving to disk. For the spectral and correlation measures however, Neo does not offer this (yet).

Are there any best practices what to do with these "pure" results, or plans towards implementing a Spectrum data model including I/O?

mdenker commented 1 year ago

Hi Joscha, thanks for your input. Indeed the returns of analysis functions is a hot topic on our agenda. For some analysis functions, we used Neo objects as return types for precisely the reasons you mentioned. However, this situation quickly comes to its limits. For example, a time histogram could be interpreted as an analog signal, but then again, in a way its more than just that -- it has a concept of bin width, for example.

Therefore, we are in the planning to move to an alternative representation building something like Neo, not for input data but for analysis results. The idea here is that a minimal number of objects are able to represent the analysis results, certain key metadata and additional info like Neo annotations, and of course a serialization to disk (maybe even the option to temporarily dump objects to disk, similar to Neo's lazy loading, to deal with large analysis results). These objects would not become part of Neo since structurally this would not fit, however, it is possible to draw links between the tools nevertheless. An early prototype of how this could look for a TimeHistogram object you can find here: https://github.com/INM-6/elephant/blob/feature/basic_provenance/elephant/buffalo/objects/histogram.py

Implementing such objects would further simplify the interoperability with a companion project, alpaca (first release pending within the next weeks, https://alpaca-prov.readthedocs.io/en/latest/), to capture the provenance of an analysis workflow. This work on provenance we had prioritized over the data objects, however, we are confident that the data objects will be on the agenda this year (together with a new object to represent an experimental trial, which is what we are currently working on.)

I hope this goes in the direction of what you are thinking. Of course, we are very open for any ideas, suggestions and contributions to this topic!

joschaschmiedt commented 1 year ago

Hi @mdenker, great to hear that there are some plans for this. I agree that this is probably out-of-scope for Neo.

In general I like the direction that the AnalysisObject is taking, and alpaca looks very interesting!

As data analysis in electrophysiology is often exploratory and constantly changing, it may already be great to offer something a little more rigid than saving a complete workspace in MATLAB, but not too much. Often an analysis result is not much more than a couple of numpy arrays plus meta-data, which could be stored in simple, future-proof formats such as JSON and (flat) HDF or NPY. If I understood it correctly, alpaca is basically already almost doing that, except serializing the arrays. Correct?

From an architectural point of view, I'm not sure if each analysis method needs to implement its own result class inheriting from the AnalysisObject. This may be useful to achieve forward-compatibility of the stored analysis results, but I'm not sure sure that's achievable or necessary. I think, AnalysisObject could be treated as a flexible container that stores as much metadata as possible (auto-magically) together with the arrays, and serializes results using simple, flat data formats.

joschaschmiedt commented 1 year ago

Thinking about it, maybe a dataclass, which tells the user and developer what attributes should be there, in combination with a metadata-enhanced serializer is robust enough. The serializer could iterate over the dataclass and store all numpy.ndarrays in binary form (H5/NPY) and everything else as ASCII (JSON/...).

Edit: I stumbled upon https://github.com/lidatong/dataclasses-json, which may be useful in this context.

mdenker commented 1 year ago

Hi, and thanks for all your great comments ideas and suggestions. I agree that your idea of having a generic AnalysisObject type container that "always works" is very interesting concept that could already help a lot. At the same time it may still be beneficial to have more specialized (e.g., subclassed) objects to describe certain recurring types of analysis results that define the structure of the analysis at greater depth and help -- in the long run -- with interoperability and clarity of code. I think both concepts could work well together.

(Regarding alpaca, it's aimed at merely tracking provenance and data flow of inputs and outputs during a script execution, but does not get involved with the structure of serialization of data as such. However, such approaches could be seen synergistic in this discussion.)