[spike] On update, tell ingest which analysis is being updated

justincc commented 5 years ago

Originally, we had a strawman assumption that a primary bundle would only ever have a single secondary bundle. All analysis updates to ingest would update that single secondary bundle.

However, in recent e-mail discussions that does not seem tenable for the reasons listed in this section of the updates pre-RFC. If we lose this assumption then the analysis component will have to tell ingest whether an incoming analysis should update an existing analysis and if so, which one. Ingest cannot determine this with the information that currently accompanies a submitted analysis.

I'm submitting this ticket as a placeholder for secondary analysis to consider this problem. The above is not the final word and is open to discussion. If we go down this route then if an analysis needs to be considered an update, then green box needs to tell ingest the UUID of the analysis to update.

samanehsan commented 5 years ago

This could be determined based on a combination of the bundle FQID and pipeline version, which is stored as the analysis protocol id in the submission envelope that is sent to ingest (here is an example analysis protocol, where the pipeline version is "cellranger_v1.0.2").

If an update to a primary bundle triggers a re-analysis and the pipeline version is the same, ingest would know to update the existing secondary analysis bundle for that pipeline version. For example, if a bundle was analyzed and then a data contributor realized the cell count was incorrect.

If the primary bundle has not changed but the pipeline version has changed, there are two scenarios to consider:

If the pipeline changed due to a bug fix, then update the existing, incorrect analysis results (or redact them and create a new version of the analysis results).
If new functionality was added into the pipeline, we would consider that a new version of the previous analysis. For example, if we release v2.0.0 of the Optimus pipeline with additional analysis features, outputs, etc. and we want to re-analyze existing datasets.

Based on this approach, there should be enough information in the submitted analysis for ingest to determine whether it is a new version of the previous analysis, or an update to the old analysis.

mweiden commented 5 years ago

Surfacing some information from the AUDR pre-RFC here:

To actually find the bundles based on the relevant input information, you could use the DSS's search API and search for the secondary analysis bundles with the correct inputs and pipeline version.

simonjupp commented 5 years ago

@samanehsan given that it is up to the analysis team to work out if they need to re-analyse data in a bundle, you also need to decide if you are updating a previous analysis or creating a new one in ingest. (You can't create a new analysis in ingest and say that it belongs to a previous bundle as it is the analysis processes that defines those bundles - although we might need more discussion on what truly defines a bundle!)

Currently bundles generated by ingest are either:

"assay bundles" (aka primary bundles) = any process that has biomaterial as input and file as output or
"analysis bundles" (aka secondary bundles) = any process that has file as input and output.

If ingest detects a new process of this type, it will trigger an export of new bundles. There's no concept of linking to another bundle (linking happens in the metadata through links.json and fqids). (The fact that we let analysis tell us the input bundle is just a convenience ingest added that requires ingest now keep a full journal of every bundle in the datastore.)

Now if ingest detects an update to one of these processes, then it follows that we must update any bundles that were created in response to that previous analysis. So if it really is an update, you must update the previous analysis, which requires green box to know the previous analysis id. This isn't something ingest can work out easily. The good news is it looks like you have a way to uniquely identify a previous analysis and the DSS provides a search API to help you.

The key point I'm trying to get across is that you can't create a new analysis in ingest and expect ingest to work out that this actually represent a previous bundle that was created for a different analysis.

HumanCellAtlas / secondary-analysis

[spike] On update, tell ingest which analysis is being updated #582