broadinstitute / lrma-aou1-panel-creation

Pipelines and evaluations covering integration, phasing, and imputation of short and structural variants for the AoU Phase 1 long-reads callset.

1 stars 0 forks source link

Meta-issue. Please spin out issues and self-assign.

For the first cut, let's organize WDLs and resources for a pipeline that takes

Inputs

the joint short-variant callset
the integrated SV callset
any required resources

and goes through

Methods

physical phasing with HiPhase
short + SV concatenation, variant deduplication, and allele-frequency filtering
statistical phasing with Shapeit4
preprocessing and bubble creation for PanGenie

covered by per-stage evaluations (as sensible)

Evaluations

vcfdist vs. HPRC dipcall truth
bipartite-graph checks
inconsistency checks for phased haplotypes
missingness metrics

For now, freely open PRs and merge without review---but please do use descriptive commit messages and PR titles. Furthermore, please commit fresh copies of all relevant WDLs and indicate versioned provenance and provide a link in a corresponding PR comment, if appropriate/possible.

I will organize the end-to-end evaluation in a megaWDL and then do a round of cleanup of the subworkflows after an initial manual run (with the goal being to show that cleanup does not affect performance). I expect running this evaluation to be manual for the near future, but we can think about CI testing later if it makes sense.

We'll continue to work with hg38 chr1:100-110Mbp to start.

Once this settles (hopefully within a week or two), we'll be better able to see where @rlorigro can slot in Hapestry methods and demonstrate improvement. If it makes sense, we can expand coverage of the pipelines upstream to intra/intersample integration and downstream to SR genotyping/phasing/imputation.

[ ] @rlorigro can run an end-to-end evaluation on his own, understands the inputs/outputs, and feels that he can either use the evaluation to inform Hapestry development or suggest improvements to the evaluation itself.

20 was just merged and gets us most of the way there. Some remaining TODOs:

[x] Add other non-vcfdist evaluations. Perhaps do these all in one subworkflow per sample and VCF stage? EDIT: Added my naive overlap check in #22.
[x] Add a step to summarize all evaluations over all samples for each VCF stage. EDIT: Added in #35. The evaluation summary can perhaps be expanded later.

But even before adding these, note that the drop in recall in the last VCF produced by PanGeniePanelCreation (e.g., for SVs, from 82% in VcfdistEvaluationShapeit4 to ~60% in VcfdistEvaluationPanel) is a good enough indicator of remaining overlap issues to guide development for now. Again, this drop results solely from the PanGenie script removing an entire allele if it is found to overlap in any one sample. In this case, the 6 additional FNs (note one locus with multiple removed alleles) are at:

Two overlapping variants at same haplotype at chr1:100528654, set allele to missing.
Two overlapping variants at same haplotype at chr1:104688481, set allele to missing.
Two overlapping variants at same haplotype at chr1:105788297, set allele to missing.
Two overlapping variants at same haplotype at chr1:107555333, set allele to missing.
Two overlapping variants at same haplotype at chr1:108689608, set allele to missing.
Two overlapping variants at same haplotype at chr1:108689608, set allele to missing.

It is easy to see how this problem exacerbates when we have many more samples.

In any case, we should already have what we need for some simple experiments, e.g.:

[ ] @samuelklee can modify PanGenie panel creation so that genotypes are set to missing, rather than whole alleles being dropped. (Hopefully this gets us most of the way there, but we should still try to get a self consistent output from the phasing pipeline.)
[ ] @fabio-cunial can insert a more sophisticated cleanup step after HiPhase and/or Shapeit4.
[ ] @hangsuUNC can experiment with phase blocks (#14 and #15).
[x] @hangsuUNC can insert a cleanup step before HiPhase (#11, although it might be better to wait until inconsistency metrics have been added for this one).
[ ] @rlorigro can sub in a Hapestry chr1:100-110Mbp VCF for the current kanpig intra + truvari inter input.
[ ] @fabio-cunial can likewise sub in a kanpig intra + kanpig inter input.

broadinstitute / lrma-aou1-panel-creation

Automated evaluation of current end-to-end pipeline. #1

20 was just merged and gets us most of the way there. Some remaining TODOs: