Improve information in `derived_from` attribute

DaniBodor commented 4 months ago

Currently, the derived_from attribute only shows the trace-back of the previous objects used and printing it specifically only shows their types and labels. At this point, those labels are not changing in many cases, meaning that reading the trace-back is not that meaningful. To figure out what exactly happened in those steps, you'd have to dig into each object (and even then, I'm not positive it's always possible to determine). For example, I got this response after reproducing the commands from a presentation by @psomhorst see here.

[EITData(label='raw'),
 ContinuousData(label='global_impedance_(raw)'),
 ContinuousData(label='global_impedance_(raw)')]

I think it would be good to make derived_from a list[tuple[str, str]], where the first string is the same as above, and the second some label for the mutation. Or maybe even a list[dict[str, Any]], where the dict holds the previous object, the mutation and all the parameters for that mutation.

psomhorst commented 4 months ago

I see the issue, but I don't think your solution is going to improve the situation much.

Currently, derived_from is a list[Any]. The items in the list are the actual objects that the newer object was derived from. The two ContinuousData objects you see are not the same object. Rather, their string representation is the same, because the label is the same.

I wonder what the use of a 'label for the mutation' would be. Is that a randomly generated label? Is it a label given by the user? If so, are these helping or complication things for the end user? Or is it given by a function? Would it then not be better to just add the name/reference to the function that created the object, instead of a label?

E.g.:

[
    (EITData(...), loading.load_eit_data),
    (ContinuousData(...), EITData.calculate_global_impedance),
    (ContinuousData(...), ContinuousData.select_by_time)
]

This theoretically should enable you to regenerate the object, if you would have the parameters that were passed to select_by_time.

I think the real problem that the objects can't really be identified in their string representation. There should be other solutions for that.

we could add the time range (e.g. ContinuousData(label="global_impedance_(raw)", t=[1020.2:5102.1])) to distinguish between original and sliced objects;
we could add the object identifier (e.g. ContinuousData(4372398320, label="global_impedance_(raw)")) emphasising that these are different objects.
we could add the function name that created the object (see above).

This begs the question: how useful is this going to be, and how much detail do you need? I think retracing steps is useful, but reconstruction of every step less so. I'd rather have a forward-plan (i.e. a description of the entire workflow from loading to exporting that can be applied to e.g. a different dataset) than a backward-plan (i.e. a way to retrace steps back to loading from a given result).

DaniBodor commented 4 months ago

This begs the question: how useful is this going to be, and how much detail do you need? I think retracing steps is useful, but reconstruction of every step less so. I'd rather have a forward-plan (i.e. a description of the entire workflow from loading to exporting that can be applied to e.g. a different dataset) than a backward-plan (i.e. a way to retrace steps back to loading from a given result).

In my mind, these are ultimately the same thing. As you are doing each step one at a time, if you store exactly what function with which parameter was used, then you could export this and use it as a "forward-plan" on your next dataset. Obviously there may be redundant or unnecessary steps in there, but these can be pruned if needed. The advantage is that you don't need to decide on every step before creating your "forward-plan". You can play around with the data until it looks how you want it to look, and now from the final object you can see exactly what you did, even if you did not keep track of each mutation.

Ultimately, it would look something like this.

[
    (EITData(...)", "", {}),
    (ContinuousData(...), select_by_time, {"start_time": 1020.2, "end_time": 5102.1}),
    (ContinuousData(...), apply_filter, {"type": LowPassFilter, "order": 10, "etc": ...}),
]

I realize that it's a lot of work to set this up, and might ultimately not be worthwhile or even possible for certain mutations. But if it does work, we basically have a setup to allow for automatic exporting of workflows, by just taking all steps minus the instances they were performed on.

Either way, I am not sure what purpose the current derived_from attribute serves.

psomhorst commented 4 months ago

In my mind, these are ultimately the same thing. As you are doing each step one at a time, if you store exactly what function with which parameter was used, then you could export this and use it as a "forward-plan" on your next dataset. Obviously there may be redundant or unnecessary steps in there, but these can be pruned if needed. The advantage is that you don't need to decide on every step before creating your "forward-plan". You can play around with the data until it looks how you want it to look, and now from the final object you can see exactly what you did, even if you did not keep track of each mutation.

I think this would be useful for a GUI/interactive application. When working in code, the end result is always determined by the script you're using, which is (easily reconstructed into) a forward plan. But if you're working in the GUI, this could be useful indeed.

I realize that it's a lot of work to set this up, and might ultimately not be worthwhile or even possible for certain mutations. But if it does work, we basically have a setup to allow for automatic exporting of workflows, by just taking all steps minus the instances they were performed on.

As long as functions are deterministic (all (?) of the current ones are), they should be possible for all mutations. Maybe it becomes a more complex tree, though.

Either way, I am not sure what purpose the current derived_from attribute serves.

The current derived_from serves the exact same purpose, but in a less detailed way. It is a way to trace back where your data came from. I you compare the two ContinuousData objects (not just the string representation, but the actual objects) is it easy to find out that the second is shorter, and to conclude it must be a slice of the first. The original idea was that with the last item in derived_from and the parameters attribute, you should be able to reconstruct the current item. I don't know whether all functions currently properly set up parameters, though.

DaniBodor commented 4 months ago

The original idea was that with the last item in derived_from and the parameters attribute, you should be able to reconstruct the current item. I don't know whether all functions currently properly set up parameters, though.

I guess this boils down to the same thing as I am suggesting, except to place parameters into the same object as derived_from (although that's a detail). It doesn't currently work though, in the notebook I linked filtered_gi.parameters returns an empty dictionary.

EIT-ALIVE / eitprocessing

Improve information in `derived_from` attribute #270