HumanCellAtlas / skylab

Soon to be deprecated in favor of broadinstitute/warp github repo. Previously: Secondary analysis pipelines
BSD 3-Clause "New" or "Revised" License
47 stars 34 forks source link

Harmonize organism reference creation #207

Closed ambrosejcarr closed 4 years ago

ambrosejcarr commented 5 years ago

Reference construction

Currently the skylab repository creates a number of references that are used by smart-seq 2, Cellranger, and Optimus. These include:

In addition, there are reference creation options for different genome subsets:

In accessory workflows, there are additionally:

Design proposal: A single workflow that takes as parameters:

Additional requirements:

barkasn commented 5 years ago
kbergin commented 5 years ago

I worry about the last bullet point on the static-inputs.json. It seems like it'd be really easy for the workflow that generates that to get out of date if we were to add an input or parameter to the workflow and not remember to update this thing that generates inputs for us. Or is a static-inputs json different from a regular inputs json?

ambrosejcarr commented 5 years ago

Must be easily extensible to new organisms

๐Ÿ‘

It should not be limited to gencode

๐Ÿ‘ but will do later, since HCA doesn't need this.

Needs to include instructions on how to add new references to the build process

๐Ÿ‘

I propose that it checks if the md5 of the latest build matches that of previous and therefore uploads file duplication

Good idea, but would prioritize lower than other suggestions since this is a cost optimization.

There should be a way to easily parametrise reference creation (e.g. STAR splice site flanking sequence length)

Is this satisfied by passing a very large number of default parameters to reference creation?

I worry about the last bullet point on the static-inputs.json. It seems like it'd be really easy for the workflow that generates that to get out of date if we were to add an input or parameter to the workflow and not remember to update this thing that generates inputs for us. Or is a static-inputs json different from a regular inputs json?

As I look through the existing reference data it's hard to intuit which references to use in what tasks, and what WDL to use to create references for new organisms. The use case I'm hoping to satisfy is that a user writing a new pipeline could easily understand which references are already generated in a satisfactory way for their pipeline, and which references need new generation tasks.

kbergin commented 4 years ago

I believe we did this... Closing issue.