Metadata for results - Githubissues

rproepp commented 10 years ago

I think this warrants its own issue:

@toddrjen wrote in #11:

This also leads me to another issue I have been thinking about: what do we do about the metadata of a neo object? When, for example, we get the average spike rate of a spike train, we end up with just a quantity. Is that what we want? Might it be a good idea to have some class that stores the output of these sorts of analyses along with the metadata of the original neo object? Or is that overkill?

The problem with this is doing it in a generic manner. You can't really use a SpikeTrain, since the resulting object may not meet the rules of a SpikeTrain. On the other hand, creating a generic "results" class would make it impossible to know what metadata you should expect from an object. And having a more specific SpikeTrainResults object would be difficult since it would need to be able to handle scalars, 1D arrays, and maybe even ND arrays depending on what analyses we allow. So it is a difficult problem, but I think having some way to keep the metadata bound to the results of some manipulation is important.

I think this verges into overkill territory :-) For most results (like the average rate of a spike train), the caller knows exactly from what object the result has been calculated. The caller also knows if and what metadata is needed, while our analysis function doesn't, so I would leave the responsibility upstream.

However, there might be analysis where this information is not available to the caller. For example, an analysis that takes a number of objects, but only uses some of them based on their content. I don't know if we will have such functions - I would try to avoid it but it might be necessary for some algorithms. In that case, I would return providence information to the caller: provide which objects have actually been used. By linking results to the actual objects used in their creation, all metadata is available and we do not need to create new result types with all the complications that come with that.

toddrjen commented 10 years ago

This wouldn't necessarily be something supported inside the algorithms. It could very well be up to the user to load the data and metadata (although it could be made very easy, such as having a method to copy the properties and annotations of a class).

Carrying around the original objects doesn't strike me as a very efficient approach. If I have gotten the average firing rate of a spike train, it is a huge memory and, if I save intermediate results, storage space savings if I can just abandon the original data. But neo doesn't really have a class that would be suitable for saving a single rate value with SpikeTrain metadata.

rproepp commented 10 years ago

Ok, I thought you were talking about supporting it in the analysis functions. But I am also skeptical whether a class to keep results and metadata in absence of the original data object would be useful, either.

For the example of a spike train: there are various kinds of metadata that might be of interest. First there's the annotations, those are just a dictionary. Easy to obtain and not much else that can be done with it. There are some properties of the class that can be considered metadata such as t_start etc. It's a small amount of work to gather all of them; a standard way to get and store all of them could be useful. And then there's the context: it is often more interesting to what unit or segment a spike train belongs to or what their metadata is than the data on the SpikeTrain object itself.

Supporting the context metadata without the original containers becomes complex quickly. Storing all of this takes quite a bit of space as well, many times more than the average rate for example, so users might want some fine grained control over what to include. That further increases the complexity and reduces the convenience advantage compared to just doing it manually.

Then there are analyses that operate on multiple objects, possibly of different types. And the result itself can be pretty much any type as you said, so I don't see an advantage of encapsulating that part, either.

mdenker commented 10 years ago

We had several discussions about this issue, and I think most here would agree that meta data management and retaining provenance information in a central container is a very difficult issue to tackle this early on. Besides the compelxity of meta data on the original data objects as outlined by rproepp, the return types and the semantic meaning of these return types can vary a lot between analysis. Some of the more advanced routines will not be able to fit their data into the neo framework, but may produce quite complex outputs. Keeping a complete trail of such information across a couple of routines chained together is very difficult I believe. I would suggest to postpone this topic until more routines have accumulated to better estimate if there is some way (and which) of managing their output.

On a much lower level, it could be worthwhile though to make it common practice to least annotate data that is output as neo object with useful information from the analysis, e.g., when filtering a signal to add the filter parameters as annotation to the resulting AnalogSignal. That would be a very small step, but already be a big step forward towards better workflows.

Moritz-Alexander-Kern commented 1 year ago

For provenance tracking see Alpaca (Automated Lightweight Provenance Capture):

Documentation: https://alpaca-prov.readthedocs.io/en/latest/

Code: https://github.com/INM-6/alpaca

NeuralEnsemble / elephant

Metadata for results #12