HumanCellAtlas / data-store

Design specs and prototypes for the HCA Data Storage System (DSS, "blue box")
https://dss.staging.data.humancellatlas.org/
Other
40 stars 6 forks source link

Spec - Query support needed by Green Box and others #54

Closed briandoconnor closed 7 years ago

briandoconnor commented 7 years ago
briandoconnor commented 7 years ago

@mikebaumann can you point me to the doc/work you and Dave S. have done (it was Dave S right?) on the queries the green box will need? I want to go through these on Friday and also link to the docs from this ticket (and also close this ticket once I've confirmed we have the queries clearly documented). Thanks.

dshiga commented 7 years ago

Hi @briandoconnor - hopefully the info below gives you everything you need. Sorry it's a bit disjointed -- it's a lightly edited transcript form my slack conversation with @mikebaumann yesterday.

At a high level there are two ways we think we'll use queries 1) pre-register a query with blue box that will trigger a notification whenever a new bundle or bundle version is created that matches -- this is how we will normally trigger workflows in green box and is the most important use case for us in the near term 2) "Ad-hoc" queries that we'll do against blue box when we release a major new pipeline version that requires us to rerun a lot of historical data through the new version (hopefully we won't need to do this for a while, though). With pre-registered queries, we would expect each notification to refer to a single bundle that matches that query. With an ad-hoc query, we would expect to get back a response that refers to all the bundles that match.

I could imagine it being something like, "Trigger on all new bundles or bundle versions that contain primary data" i.e. everything except bundles that green box itself has deposited, with outputs from green workflows. We might want to also restrict by assay type -- "all new bundle versions that have assayType = DropSeq". If we do that then we'd have one query for each pipeline that we support, each specifying a different assay type.

I don't know if "primary data" is a coherent concept, but I could imagine a field in the metadata called analysisLevel or something that would be 1 for fastqs, 2 for bam files and gene/cell expression matrices that green produces, 3 for cohort analyses that green might someday support (outputs from a pipeline that takes gene/cell expression matrices and identifies clusters, etc). Or perhaps it would be simpler, like primaryData=true or false -- then there would just be primary data and everything else. We just need some way to avoid triggering on bundles that green deposits, but I'm open to ideas about how to specify that.

For example, given the metadata here: https://github.com/HumanCellAtlas/data-bundle-examples/blob/develop/dropseq/GSE81904/ we might just query for assay.single_cell.method = "drop-seq" (assuming that we can query the contents of assay.json this way). Based on the json I see in that dir, I'm not sure how we can easily tell that it's primary data though as I could imagine a lot of this metadata being copied through to the secondary analysis bundle deposited by green.

No other criteria are coming to mind that we would need to restrict on -- maybe some will come up as we go along. I think the idea for green is that it runs on everything that's deposited into blue, so in some sense we want to be notified about everything (except green's own outputs).

I've seen other suggestions that we want to restrict on species = human and maybe other things -- I don't know that we really need that (as long as the species is in the metadata, we can intelligently pick the appropriate reference in the pipeline).

It would be very convenient if the notifications could contain the manifest including the cloud storage paths (I assume bundle id would be in there too). I know there is some concern about not making these messages too big, so we have been preparing to make follow up http requests to blue to get that manifest, assuming we only get the bundle id and maybe a timestamp or something in the message.

So we expect the queries that we register for notification will be rather broad, with just a few values to match using simple boolean operators. I'm not clear on how many files might be in a bundle, hopefully it's pretty small but others like Tim might know better than me.

I should also mention that I've been assuming that one bundle = one notification = one workflow run and hopefully that holds at least in the short term.

It would be convenient if the gs URL could be included in the file metadata that is being spec-ed here: https://docs.google.com/document/d/1jQGC0Ah2gdtzUxEeVvj0OiGM9HbmOjn8wt6LazIeUhI/edit

ttung commented 7 years ago

If all green box outputs write a known key to metadata, then you can just add another condition to your query (i.e., !written_by=greenbox). Let's not build any special logic to support that. :)

dshiga commented 7 years ago

That would work for now. However, there might be scenarios down the road where we do consume our own outputs in order to do cohort-level analysis.

briandoconnor commented 7 years ago

Hi @dshiga, @ttung, @mikebaumann

I tried to distill our current thinking about the query for the Boston demo coming up on the 6/27. Take a look at the search criteria documented here and expand as needed: https://docs.google.com/document/d/1oNgk15q90a6O33vUszUTq9bwwe3gp349bYBIO0BdpPo/edit#heading=h.z104wvnawtxf

I'm not sure this is the best place for it, there are a few different specs where this might fit. But I wanted to put it somewhere. Other places where we discuss the query strategy for the Green box:

@dshiga your description above is really great! I dropped it into the first spec above since I didn't want it to be lost when I close this ticket.