Open justincc opened 5 years ago
Doing this may not be necessary. Let me try to unpack this.
First some assumptions about the data lifecycle:
Under these assumptions, you don't need to update secondary analysis bundles when a pipeline changes. You just create a new bundle.
Pipelines output data and metadata about the analysis. If there is an update that requires re-running analysis, that means the metadata about the analysis workflow (the analysis process json) as well as the data outputs will be different. The analysis protocol json will only change if the pipeline version changes. I think a new bundle makes sense if the data will be different.
On an analysis update, green box needs to tell ingest how the files in the update related to the files in the previous version of the analysis (i.e. are they new versions of existing files, entirely new files or should previous files be removed from the new bundle). This is explained further in this section of the updates RFC.
I am submitting this ticket to ask you to consider a design for this problem. This might be relatively simple - it could be that:
a) any updated file with the same path is a new version of an existing file with the same path b) any updated file that doesn't have a corresponding path in the old bundle is a new file c) any file in the old bundle that doesn't have one with that path in the updated analysis shouldn't be in the updated bundle.
OR possibly that all files can be considered new and don't need to be related to the previous bundle files. This may or may not have implications for provenance training.
The above are suggest solutions off the top of my head, everything is open for discussion. This is a follow on from #582