Closed rproepp closed 8 months ago
This wouldn't necessarily be something supported inside the algorithms. It could very well be up to the user to load the data and metadata (although it could be made very easy, such as having a method to copy the properties and annotations of a class).
Carrying around the original objects doesn't strike me as a very efficient approach. If I have gotten the average firing rate of a spike train, it is a huge memory and, if I save intermediate results, storage space savings if I can just abandon the original data. But neo
doesn't really have a class that would be suitable for saving a single rate value with SpikeTrain
metadata.
Ok, I thought you were talking about supporting it in the analysis functions. But I am also skeptical whether a class to keep results and metadata in absence of the original data object would be useful, either.
For the example of a spike train: there are various kinds of metadata that might be of interest. First there's the annotations, those are just a dictionary. Easy to obtain and not much else that can be done with it. There are some properties of the class that can be considered metadata such as t_start etc. It's a small amount of work to gather all of them; a standard way to get and store all of them could be useful. And then there's the context: it is often more interesting to what unit or segment a spike train belongs to or what their metadata is than the data on the SpikeTrain
object itself.
Supporting the context metadata without the original containers becomes complex quickly. Storing all of this takes quite a bit of space as well, many times more than the average rate for example, so users might want some fine grained control over what to include. That further increases the complexity and reduces the convenience advantage compared to just doing it manually.
Then there are analyses that operate on multiple objects, possibly of different types. And the result itself can be pretty much any type as you said, so I don't see an advantage of encapsulating that part, either.
We had several discussions about this issue, and I think most here would agree that meta data management and retaining provenance information in a central container is a very difficult issue to tackle this early on. Besides the compelxity of meta data on the original data objects as outlined by rproepp, the return types and the semantic meaning of these return types can vary a lot between analysis. Some of the more advanced routines will not be able to fit their data into the neo framework, but may produce quite complex outputs. Keeping a complete trail of such information across a couple of routines chained together is very difficult I believe. I would suggest to postpone this topic until more routines have accumulated to better estimate if there is some way (and which) of managing their output.
On a much lower level, it could be worthwhile though to make it common practice to least annotate data that is output as neo object with useful information from the analysis, e.g., when filtering a signal to add the filter parameters as annotation to the resulting AnalogSignal. That would be a very small step, but already be a big step forward towards better workflows.
For provenance tracking see Alpaca (Automated Lightweight Provenance Capture):
Documentation: https://alpaca-prov.readthedocs.io/en/latest/
I think this warrants its own issue:
@toddrjen wrote in #11:
I think this verges into overkill territory :-) For most results (like the average rate of a spike train), the caller knows exactly from what object the result has been calculated. The caller also knows if and what metadata is needed, while our analysis function doesn't, so I would leave the responsibility upstream.
However, there might be analysis where this information is not available to the caller. For example, an analysis that takes a number of objects, but only uses some of them based on their content. I don't know if we will have such functions - I would try to avoid it but it might be necessary for some algorithms. In that case, I would return providence information to the caller: provide which objects have actually been used. By linking results to the actual objects used in their creation, all metadata is available and we do not need to create new result types with all the complications that come with that.