broadinstitute / palantir-workflows

Utility workflows for the DSP hydro.gen team (formerly palantir)
BSD 3-Clause "New" or "Revised" License
17 stars 8 forks source link

Add pipeline for benchmarking SVs #128

Closed rickymagner closed 9 months ago

rickymagner commented 1 year ago

This issue serves as a discussion hub for development around a pipeline for benchmarking SVs. Development currently is happening on the rm_benchmark_svs branch. There you can find the following useful documents:

If you have comments, issues using the pipeline, etc. please add them to this thread. I will post updates here during development as well.

samuelklee commented 1 year ago

Thanks again for the demo just now, @rickymagner! Things are looking great.

I've just shared the kage-lite-dev-1 Terra workspace with you. This currently hosts WDLs for running KAGE, PanGenie, and a leave-one-out evaluation that calculates weighted genotype concordance of imputed short-read genotypes with the panel genotypes for the left-out sample (stratified by length and genomic context, although I also have some offline versions that stratify by AF as well).

The workspace bucket contains some panel VCFs that may be interesting to throw through your prototype:

And now some runs that contain genotyped VCFs from KAGE and PanGenie. For these, you'll find VCFs in corresponding shards for each left-out sample:

You'll also find corresponding plots in the call-CalculateMetrics* folders (these are the 1D plots I've been showing in recent meetings).

Here are the runs:

As we discussed, it would be great to see the panels compared against each other, as well as the various combinations of panel vs. case. Might also be interesting to check each against your current gold standard set as well (although I may need to actually run that particular sample to generate the case VCFs!)

Finally, here's a (very rough first cut) of the sort of benchmarking heatmap that might be useful: KAGE+GLIMPSE I was planning on adding some annotations to each cell, but hovering tooltips would be even better!

samuelklee commented 1 year ago

@rickymagner just a heads up, I'm cleaning up the workspace and deleting a bunch of runs, including the two I linked above. Here are some more recent runs that might be useful for you:

HGSVC2 32 panel samples, 10 cases, chr1: 18e46b03-f34f-4e22-9a10-01e4ec6f6b8d 1kg 3202 panel samples, AF>=0.001, 10 cases, chr1: 9d99a3c8-f3fe-4f29-8b66-0eccecb04770 1kg 127 panel samples, AF >= 0.02, 10 cases, chr1: ecaf79f2-9785-436e-a603-989864c54444 HGSVC2 32 panel samples, 1 HG00731 case, chr1-22+chrX: 721c121d-932f-4ee4-9f71-6ed1591b3f74

rickymagner commented 9 months ago

Completed in #159