Harmonize organism reference creation

ambrosejcarr commented 5 years ago

Reference construction

Currently the skylab repository creates a number of references that are used by smart-seq 2, Cellranger, and Optimus. These include:

hisat2
rsem
star
cellranger
kallisto

In addition, there are reference creation options for different genome subsets:

primary assembly
transcriptome

In accessory workflows, there are additionally:

kallisto tests for ss2
ss2 tests for hisat2 and star
A reference checker that also tests SS2 for star.

Design proposal: A single workflow that takes as parameters:

organism
gencode fasta target
gencode gtf target
booleans (default False) to request the creation the various required references from these files

Additional requirements:

The workflow should upload the files to a specified bucket location and hierarchy, labeling them with the current date:
```
gs://hca-dcp-sc-pipelines-test-data/pipelineReferences/organism/reference_name/reference_version/datestamp/files
```
The workflow should expose a test that is run as a cron job which attempts to align files with the most recent reference for each organism
The workflow should generate a static-inputs.json file for each workflow run by skylab.

barkasn commented 5 years ago

Must be easily extensible to new organisms
It should not be limited to gencode
Needs to include instructions on how to add new references to the build process
I propose that it checks if the md5 of the latest build matches that of previous and therefore uploads file duplication
There should be a way to easily parametrise reference creation (e.g. STAR splice site flanking sequence length)

kbergin commented 5 years ago

I worry about the last bullet point on the static-inputs.json. It seems like it'd be really easy for the workflow that generates that to get out of date if we were to add an input or parameter to the workflow and not remember to update this thing that generates inputs for us. Or is a static-inputs json different from a regular inputs json?

ambrosejcarr commented 5 years ago

Must be easily extensible to new organisms

👍

It should not be limited to gencode

👍 but will do later, since HCA doesn't need this.

Needs to include instructions on how to add new references to the build process

👍

I propose that it checks if the md5 of the latest build matches that of previous and therefore uploads file duplication

Good idea, but would prioritize lower than other suggestions since this is a cost optimization.

There should be a way to easily parametrise reference creation (e.g. STAR splice site flanking sequence length)

Is this satisfied by passing a very large number of default parameters to reference creation?

I worry about the last bullet point on the static-inputs.json. It seems like it'd be really easy for the workflow that generates that to get out of date if we were to add an input or parameter to the workflow and not remember to update this thing that generates inputs for us. Or is a static-inputs json different from a regular inputs json?

As I look through the existing reference data it's hard to intuit which references to use in what tasks, and what WDL to use to create references for new organisms. The use case I'm hoping to satisfy is that a user writing a new pipeline could easily understand which references are already generated in a satisfactory way for their pipeline, and which references need new generation tasks.

kbergin commented 4 years ago

I believe we did this... Closing issue.

HumanCellAtlas / skylab

Harmonize organism reference creation #207

Reference construction

Design proposal: A single workflow that takes as parameters: