HumanCellAtlas / secondary-analysis

Secondary Analysis Service of the Human Cell Atlas Data Coordination Platform
https://pipelines.data.humancellatlas.org/ui/
BSD 3-Clause "New" or "Revised" License
3 stars 2 forks source link

[spike] On update, tell ingest how files in the update relate to files in the previous analysis #583

Open justincc opened 5 years ago

justincc commented 5 years ago

On an analysis update, green box needs to tell ingest how the files in the update related to the files in the previous version of the analysis (i.e. are they new versions of existing files, entirely new files or should previous files be removed from the new bundle). This is explained further in this section of the updates RFC.

I am submitting this ticket to ask you to consider a design for this problem. This might be relatively simple - it could be that:

a) any updated file with the same path is a new version of an existing file with the same path b) any updated file that doesn't have a corresponding path in the old bundle is a new file c) any file in the old bundle that doesn't have one with that path in the updated analysis shouldn't be in the updated bundle.

OR possibly that all files can be considered new and don't need to be related to the previous bundle files. This may or may not have implications for provenance training.

The above are suggest solutions off the top of my head, everything is open for discussion. This is a follow on from #582

mweiden commented 5 years ago

Doing this may not be necessary. Let me try to unpack this.

First some assumptions about the data lifecycle:

  1. Pipelines output data (vs metadata)
  2. Data outputs of new versions of pipelines are distinct from outputs of prior versions (unless they are identical bit-for-bit, in which case you don't need this feature).
  3. Data needs stable accession identifiers. (Once data is out there, we need to provide stable links to it.)
  4. To retire old analyses, we can simply leave them out of the next data release.

Under these assumptions, you don't need to update secondary analysis bundles when a pipeline changes. You just create a new bundle.

samanehsan commented 5 years ago

Pipelines output data and metadata about the analysis. If there is an update that requires re-running analysis, that means the metadata about the analysis workflow (the analysis process json) as well as the data outputs will be different. The analysis protocol json will only change if the pipeline version changes. I think a new bundle makes sense if the data will be different.