Support stratified variant comparison

TimD1 / vcfdist

vcfdist: Accurately benchmarking phased variant calls

GNU General Public License v3.0

76 stars 7 forks source link

Support stratified variant comparison #4

Closed jdidion closed 1 year ago

jdidion commented 1 year ago

GIAB provides stratification bed files and hap.py gives the option to produce stratified results based on these files. It's very useful to be able to look at benchmarking performance by type of region. It would be great if vcfdist could support this option as well.

TimD1 commented 1 year ago

I agree.

vcfdist currently outputs detailed per-variant, per-cluster, and per-phaseblock results in a custom CSV format.

I believe the best approach to supporting stratification would be to instead report results in hap.py's Intermediate VCF File format (Supplementary Note 3 of "Best practices for benchmarking..."). That way, vcfeval could be used as a comparison engine within hap.py, just like vcfeval or xcmp. This is what I plan to work on next; it shouldn't take too long.

jdidion commented 1 year ago

Sounds like a great idea.

TimD1 commented 1 year ago

The only downside is that hap.py most likely can't deal with partial positive variants. I may need to consider them false positives for interoperability. The majority of vcfdist's improvement came from standardizing the variant representation and enforcing phasing, so this shouldn't impact results too much, although it's certainly not ideal.

TimD1 commented 1 year ago

In the meantime, one option would be to run bedtools intersect of your high-confidence BED and each stratification region, since vcfdist currently accepts a single BED file for region selection.