Child datasets and Caching in Pipeline

Hovden commented 7 years ago

I just wanted to follow up regarding the child datasets.

The primary intended functionality of child datasets was to cache the pipeline, not create a new branch. Branching is an added benefit but not something we need to make full featured—it is OK if modifications to a parent pipeline rerun entirely.

Using child datasets as a means to cache is important. So there should be a way to "create a child". This could just be a simple Python bit in the "Data Transforms" menu called "Create Child". Having it as a python transform would allow it to serve as example code for new programmers. The small downside is it creates an extra line in the pipeline.

It would be nice if "Create Child" transform was also accessible by right clicking on the dataset in the pipeline. By creating a child, users have effectively cached the transforms leading up to it.

cryos commented 7 years ago

So, the point I was attempting to make earlier was that if we have a child dataset then it could be marked as such. If we make the assumption that a child dataset producing operation does not modify its parent then I think we can effectively mark it as frozen, use an :iceland: icon to denote that, and avoid rerunning that.

Alternately we could probably create a special operation object that effectively pins it to that point. By creating a special operation we can know for certain that it does not have any interaction, and so can be frozen without risk. You could then reconstruct that volume rather than the result of the linear pipeline.

Hovden commented 7 years ago

I think we have added more functionality to child datasets than originally intended—we need simply to create marks (e.g. children) to prevent re-runing the whole pipeline. A child pipeline should rerun only the things after the beginning of its creation. A child doesn't need to see the rest of the parent's pipeline. Having many child datasets, and allowing more modifications to the parent, etc are extra features—they are also solvable with cloning.

yijiang1 commented 7 years ago

it is OK if modifications to a parent pipeline rerun entirely.

I don't think it's OK to always rerun tomographic reconstructions that take a lot time in practice.

Having many child datasets, and allowing more modifications to the parent, etc are extra features—they are also solvable with cloning.

In this case, I'd suggest we clone reconstructions by default instead of making them child datasets.

yijiang1 commented 7 years ago

If we make the assumption that a child dataset producing operation does not modify its parent

What would happen if a child dataset producing operation modifies its parent? The result is under child data's pipeline and I don't see how it can interfere with parent's transforms.

mathturtle commented 7 years ago

@Hovden I thought the original point of having a child dataset was to produce a dataset of a different shape (i.e. reconstruction) without destroying the original (parent) data so that the user can continue to work with both. Thus having multiple child datasets was the point of the feature. Label maps could just as easily have been extra arrays on the existing image data, so they didn't need to be child datasets.

The child doesn't see the rest of its parent's pipeline, just the pipeline needed to make it. Here is the issue. We have data A, we run it though a pipeline to produce data B, then another to produce C and another to produce D. Then we run a reconstruction on D. But at each step we throw away the old data so all we have is A (we can re-read the file) and D. Then we do a reconstruction on D to produce child dataset R. Technically this also produces dataset E and throws D away (but all of our operators that produce child data don't modify the original, so D = E). However there is no reason this has to be the case and nothing prevents the operator from modifying the output in the parent pipeline as well as creating a child dataset. Then you adds another operator that modifies E to produce F and throws away E. If you want to re-run this operator to produce a slightly different F, you need to get E again. But you only have F, R, and A. So you have to re-run the pipeline from A to get E again so that this new operator can be re-run.

cryos commented 7 years ago

I think introducing a simpler node that can largely reuse most of what was done, but that is created in a default detached state is reasonable. I think marking it as a hard caching node would be reasonable, and we can make it a special type to ensure no unexpected interactions are possible. I agree on the need for the simpler user case, and it was subsumed by a need to more fully support label maps for segmentation.

I think we can do this in a reasonable amount of development time by keeping it simple, and offering a simple interaction mechanism. The thing that is harder, and may take more time to solve, is saving out pipeline elements like this to support restoring a state file with these elements. Our state file currently saves state but no data, and we must extend it to save all data if we hope to fully restore a richer pipeline such as this.

mathturtle commented 7 years ago

Picture hopefully helps: pipeline_test

Only the letters inside boxes are stored. And there is no reason that the operator that produces R has to pass D through to E unmodified. I don't have a use case that would do that... but it seems like an artificial limitation to put on the operator.

cryos commented 7 years ago

@mathturtle not sure who you are replying to, like I said creating a special node for caching would make the guarantee that nothing is modified but simply cached, and therefore make it simple to simply skip it as a no-op. This is the simpler case that is missing, and would not significantly complicate the pipeline (but it does make it more necessary to save data from the pipeline to support fully restorable pipelines as there is no guarantee we could get back to that state by simply regenerating the data).

I am not aware of a real operator that both produces child data and creates a child though (and that may be the truly exceptional case).

yijiang1 commented 7 years ago

Ok now I understand the problem thanks to @mathturtle 's example.

I think freezing the op by creating a special node is a good idea.

cryos commented 7 years ago

Great, it sounds like we are in agreement assuming @Hovden has no further comments. We will see what we can get in place, but this and detached datasets depend to an extent on some improvement to the state file mechanism to be fully reproducible (detachment like this wasn't considered in the original design).

yijiang1 commented 7 years ago

Also, for the record I'm totally OK with cloning reconstructions to new parent datasets. We can just add some instructions/warnings in tutorial and transform descriptions.

Hovden commented 7 years ago

@yijiang1 Cloning a reconstruction (or anything down the pipeline) would lose reproducibility. The idea of a child dataset is it would retain the steps leading up to it by being associated with the parent pipeline. One should only want to clone the steps of the parent before the reconstruction and run an new reconstruction.

If I follow @cryos properly, he is suggesting that there is a way to cache all the operations at a point in the pipeline (like when a child is created). The steps before this cache would no longer be modifiable. That, or something similar, would work.

Just a note: Originally, I thought that a child dataset would be nothing more than an new data set held in memory and indented in the pipeline. Anything above it would effectively be ignored and not modifiable without first deleting the child. New data transforms are applied to the child. Like any new dataset, it would need to be saved if you wanted to retain it after closing.

cryos commented 7 years ago

This is now resolved, and we will track the issue with state files crashing in #983.

OpenChemistry / tomviz

Child datasets and Caching in Pipeline #979