broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

Implement gCNV mega workflow. #5633

Open samuelklee opened 5 years ago

samuelklee commented 5 years ago

Will include clustering subworkflow #5632 and subdivision into batches for gCNV cohort/case workflows.

samuelklee commented 5 years ago

Proof-of-principle in sl_mega_wdl branch in gatk-evaluation.

samuelklee commented 3 years ago

I’ve also revisited this work for MalariaGEN, additionally including further cleanup of the canonical part of the WDLs (mostly low hanging fruit like adding structs, which help a lot for cutting down parameter cruft on Terra).

For ease of iteration, this work broke things up into 3 pushes of a button: 1) data collection, 2) preclustering (done in a relatively modular way, so you can swap in whatever clustering script you like, as long as it outputs hard/soft responsibilities) +random selection of training cohorts, and 3) cohort mode + scattered case mode on all clusters. But no reason we couldn’t link some of those up.

No problem running 16k samples, with 6 clusters and 300 training samples per, but also note I was only running a single genomic shard containing CNVs of interest for this use case. (I did manage to break Terra for a few days when I tried to attach collected counts to the data model in what I would’ve thought would be a relatively trivial way, but that’s another matter.)

I’ve shared some version of these WDLs over Slack previously, but happy to also open up a branch here.

I think some of this work may be replicated in GATK-SV and I’m also not sure what we want to make canonical. Surely most users will run only a single cluster. But from the perspective of our MalariaGEN collaborators, the more of what I put together for them being made canonical, the better, as this will ease future maintainability. But will leave it up to other current stakeholders.

mwalker174 commented 2 years ago

I think the 3-step breakdown is the way to go. We would like to have something like this for gatk-sv as well. We typically batch based on 1) median depth quantile, and 2) dosage bias scoring (https://github.com/RCollins13/WGD). I think PCA might be a generalization of the latter, but we should try to converge on a batching scheme to use across projects (MalariaGEN, gatk-sv, and various WES projects). This step has also proven critical for QC and sample filtering.