broadinstitute / lrma-aou1-panel-creation

Pipelines and evaluations covering integration, phasing, and imputation of short and structural variants for the AoU Phase 1 long-reads callset.
1 stars 0 forks source link

Automated evaluation of current end-to-end pipeline. #1

Open samuelklee opened 1 month ago

samuelklee commented 1 month ago

Meta-issue. Please spin out issues and self-assign.

For the first cut, let's organize WDLs and resources for a pipeline that takes

Inputs

and goes through

Methods

covered by per-stage evaluations (as sensible)

Evaluations

For now, freely open PRs and merge without review---but please do use descriptive commit messages and PR titles. Furthermore, please commit fresh copies of all relevant WDLs and indicate versioned provenance and provide a link in a corresponding PR comment, if appropriate/possible.

I will organize the end-to-end evaluation in a megaWDL and then do a round of cleanup of the subworkflows after an initial manual run (with the goal being to show that cleanup does not affect performance). I expect running this evaluation to be manual for the near future, but we can think about CI testing later if it makes sense.

We'll continue to work with hg38 chr1:100-110Mbp to start.

Once this settles (hopefully within a week or two), we'll be better able to see where @rlorigro can slot in Hapestry methods and demonstrate improvement. If it makes sense, we can expand coverage of the pipelines upstream to intra/intersample integration and downstream to SR genotyping/phasing/imputation.

samuelklee commented 1 month ago

20 was just merged and gets us most of the way there. Some remaining TODOs:

But even before adding these, note that the drop in recall in the last VCF produced by PanGeniePanelCreation (e.g., for SVs, from 82% in VcfdistEvaluationShapeit4 to ~60% in VcfdistEvaluationPanel) is a good enough indicator of remaining overlap issues to guide development for now. Again, this drop results solely from the PanGenie script removing an entire allele if it is found to overlap in any one sample. In this case, the 6 additional FNs (note one locus with multiple removed alleles) are at:

Two overlapping variants at same haplotype at chr1:100528654, set allele to missing.
Two overlapping variants at same haplotype at chr1:104688481, set allele to missing.
Two overlapping variants at same haplotype at chr1:105788297, set allele to missing.
Two overlapping variants at same haplotype at chr1:107555333, set allele to missing.
Two overlapping variants at same haplotype at chr1:108689608, set allele to missing.
Two overlapping variants at same haplotype at chr1:108689608, set allele to missing.

It is easy to see how this problem exacerbates when we have many more samples.

In any case, we should already have what we need for some simple experiments, e.g.:

samuelklee commented 1 month ago

@kvg probably a good point for you to take a look and get caught up. Sorry, took just over a week :laughing: