broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

Concordance tool for somatic variant calling #7669

Open amithasandur opened 2 years ago

amithasandur commented 2 years ago

This ticket is for tracking the progress of the concordance project described here which is in collaboration with @davidbenjamin from DSP. The output of GATK Mutect2 is in VCF format, and there is an existing GATK tool to compare the output VCF to a truth VCF containing real mutations. David is working on improving Mutect3, as part of that there is a need for a tool specific to somatic variant calling that can diagnose patterns of good and bad performance. To get started, he provided resources of example VCFs of filtered calls from Mutect2 and a truth vcf. I am working on creating a Python script to count the number of false positives/negatives by comparing the mutations in both the VCFs using cyvcf2 library. After that the next step is to stratify these values by allele fraction and other filters applied by Mutect2.

cc: @fleharty @brigranger

amithasandur commented 2 years ago

Installed cyvcf2, read output vcf and truth vcf, parsed output vcf variants into a list of filters using set(variant.FILTER.split(";")).

@davidbenjamin I noticed that while parsing the truth vcf and extracting variant.FILTER it returns None in each line, on looking into the cyvcf2 documentation it mentions that a value of PASS or ‘.’ in the VCF will give None for this function. However in our email discussion you mentioned that in the output vcf (from Mutect2 pipeline) if the variant is unfiltered then its value is None. Does the None value in output & truth vcf mean the same in both cases?

davidbenjamin commented 2 years ago

@amithasandur The Mutect2 pipeline output and the truth VCFs are both filtered, so whenever variant.FILTER is None it means in both VCFs that the variant is PASS.

In intermediate stages of the pipeline -- after Mutect2 but before FilterMutectCalls -- there are VCFs with . in the FILTER column, but you may assume that your tool never needs to deal with that case.

DarioS commented 2 years ago

Out of curiosity, how did you define the truth set of somatic SNVs in this project?

amithasandur commented 2 years ago

Met with David today and we reviewed the code I had written for parsing truth and query vcf and counting true positives, true negatives, false positives and false negatives. Next, I will work on stratifying these counts by properties such as the filters applied by Mutect2 and allele fraction.

Note to self- here are my action items:

cc: @fleharty @brigranger

davidbenjamin commented 2 years ago

@DarioS For now we're just testing on artificial "truth" VCFs.