A few minor (but quality-of-life) improvements, some questions too

GuillaumeHolley commented 7 months ago

Hi,

I started to use vcfdist recently to evaluate some small variant calls from dual assemblies I have generated and it has been going well so far, thanks for the great work :) I have a few questions and also some minor suggestions if you're open to it.

Questions

It is unclear to me if the flip error rate and the switch error rate can be used to actually compare two query VCF files? My understanding is that those metrics are related to the super clusters which might or might not be different from one query VCF to another?
Could you point out to some documentation about the GA4GH VCF output for summary.vcf? I'm not sure how to interpret duplicated variants in this output (same positions and alleles but different genotypes?). Seems like homozygous genotypes are split into two heterozygous variants, is it to give partial credit?
I would like to stratify my "false" variants which got some credit by edit distance to baseline variants. Is this possible? Right now, seems that the best I could do is stratification by benchmark score (FORMAT/BC).

Minor improvements?

Would it be possible to output the Precision, Recall and F1 for combined SNPs and indels? Right now, I am doing it manually from the output but it is a little tedious.
In the summary.vcf file, the field FORMAT/BC is said to be of type string in the VCF header which I thought was a mistake at first until I saw the field could be .. It seems the only reason BC would be equal to . is when evaluating a query variant which is a FN. In such case, reporting 0.00000 would be fine I think? Point is, having FORMAT/BC as a string rather than a float prevents any kind of filtering based on this field with bcftools
Very minor: in the header of summary.vcf, replace variant Quality with Variant Quality for the field GQ

Thanks, Guillaume

TimD1 commented 7 months ago

Thanks for bringing these up to me, happy to answer questions and make changes.

1. flip and switch error rates for variants vs superclusters flip error comparison: not ideal Because all variants within a supercluster are assumed to be correctly phased relative to one another, phasing is assessed only at the supercluster level. This is not ideal for comparing total flip errors, because a single flipped supercluster could contain either 1 or 100 variants. I could also report the total number of variants within flipped superclusters, if that metric would be useful for you. That would give you an upper bound on the total variant flip errors (counting supercluster flip errors, as currently implemented, is a lower bound because each supercluster contains 1+ variants). You are correct that you cannot make assumptions about two query VCFs having similar superclustering. In practice, I have found that the majority of superclusters consist of a single variant. Eventually, I hope to relax the requirement for local variant phasing, which will enable better evaluation of single variant flip errors.

switch error comparison: very good On the other hand, the switch error comparison is very good. The only caveat is that switches are only allowed to occur between superclusters, not within (if one does, then the switch will still be discovered after that supercluster, and that supercluster may be reported as a flip error). As noted in the recent pre-print, evaluating superclusters instead of individual variants significantly reduces false positives compared to WhatsHap.

2. summary.vcf documentation and splitting homozygous variants
The summary.vcf documentation was recently added to the wiki, which is currently a work in progress. You are correct that vcfdist splits homozygous variants. This is necessary for credit assignment when complex variants or genotyping errors are involved. I'd be happy to discuss this further, and if there's a way around needing to do this.

3. stratify by distance from ground truth At the moment, BC is the best you can do (percent improvement from baseline). I would consider adding another tag to store the remaining distance from ground truth. Please note that BC is evaluated per sync group, not per variant. In other words, if there is a complex variant where variant representation is somewhat arbitrary, these variants are grouped and considered together as a unit.

4. add output metrics for all variants There's two ways I could do this: report output summary metrics for small (SNP, INDEL) variants, or for all (SNP, INDEL, SV) variants. I'm leaning towards the latter, and then you can exclude SVs from your analysis if you want using --sv-threshold or --largest-variant.

5. FORMAT/BC is type string You're right, I can change this.

6. Summary VCF capitalization My capitalization in the VCF header is intended to show the logical mapping from the full description to the two-character abbreviation, so I only capitalized the letters present in the abbreviated format tags.

TimD1 commented 7 months ago

Closing, since I've made the discussed changes in the new v2.3.3 release.

I would also like to note that the --phasing-threshold parameter may be of interest to you. If you set it to 1.0, then only superclusters where all variants within it have the opposite phasing will be counted.

GuillaumeHolley commented 7 months ago

Hi @TimD1,

Thank you for your detailed explanation and the new release, this is very helpful! Regarding point 1, the output log of vcfdist has Supercluster switch error rate but switches occur in-between superclusters and not within right? I also would like to know how this number is computed: I tried to use the output files to compute it but couldn't find the same number as the one reported in my terminal.

Thanks, Guillaume

TimD1 commented 7 months ago

That is correct, switches occur between superclusters. As a result, there is the potential for a switch error to occur after every supercluster. The error rate is computed by dividing the total number of switch errors by the total number of superclusters (in the code here).

I recently added more detail on vcfdist's phasing algorithm to the wiki here.

GuillaumeHolley commented 7 months ago

The error rate is computed by dividing the total number of switch errors by the total number of superclusters

I don't know how I messed that up, I'm almost certain this is the first thing I did :D In any case, this is all clear now, thank you for spending time on this!

TimD1 / vcfdist

A few minor (but quality-of-life) improvements, some questions too #18