dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
142 stars 37 forks source link

Incorrect variant call #254

Closed marchoeppner closed 3 years ago

marchoeppner commented 3 years ago

We found a back-to-back call of two SNPs that we cannot explain as the BAM file suggests a deletion. IGV screenshot: https://www.dropbox.com/s/c0wfelxc1cca14b/igv_snapshot.png?dl=0

Happy to provide a BAM file, gVCFs etc. But maybe this is easy enough to explain; I just cannot figure out why the joint call for this locus comes out as follows:

chr19   15174241        rs1044006       T       C       66      .       AF=0.75;AQ=66   GT:DP:AD:GQ:PL:RNC      1/1:112:0,112:62:65,63,0:..     1/1:169:0,169:59:61,60,0:..     ./1:78:.,78:.:0,0,0:O.  1/1:208:0,208:62:66,63,0:..     1/1:94:0,94:57:58,60,0:..    0/1:196:96,99:53:52,0,68:..      1/1:156:0,156:58:59,60,0:..     0/1:202:105,97:53:53,0,63:..
chr19   15174242        chr19_15174242_G_C      G       C       58      .       AF=0.0625;AQ=58 GT:DP:AD:GQ:PL:RNC      0/0:112:112,0:50:0,300,2999:..  0/0:170:170,0:50:0,300,2999:..  0/1:177:77,100:56:58,0,60:..    0/0:207:207,0:50:0,300,2999:..  0/0:93:93,0:50:0,282,2819:..  0/0:198:198,0:50:0,300,2999:..  0/0:155:155,0:50:0,300,2999:..  0/0:204:204,0:50:0,300,2999:..

The corresponding gVCF block from the sample carrying the complex mutation:

chr19   15174058        .       G       <*>     0       .       END=15174239    GT:GQ:MIN_DP:PL 0/0:50:97:0,291,2909
chr19   15174240        .       CT      C,<*>   28.1    PASS    .       GT:GQ:DP:AD:VAF:PL      0/1:25:177:78,99,0:0.559322,0:20,0,42,990,990,990
chr19   15174241        .       T       C,<*>   21.3    PASS    .       GT:GQ:DP:AD:VAF:PL      0/1:8:78:0,78,0:1,0:13,0,21,990,990,990
chr19   15174242        .       G       C,<*>   58.5    PASS    .       GT:GQ:DP:AD:VAF:PL      0/1:56:177:77,100,0:0.564972,0:58,0,60,990,990,990
chr19   15174243        .       G       <*>     0       .       END=15174410    GT:GQ:MIN_DP:PL 0/0:50:177:0,300,2999

The gVCFs were called with DeepVariant 1.0 - and the per-sample VCFs from that same process show the variant correctly.

chr19   15174240        .       CT      C       28.1    PASS    .       GT:GQ:DP:AD:VAF:PL      0/1:25:177:78,99:0.559322:20,0,42
chr19   15174241        .       T       C       21.3    PASS    .       GT:GQ:DP:AD:VAF:PL      0/1:8:78:0,78:1:13,0,21
chr19   15174242        .       G       C       58.5    PASS    .       GT:GQ:DP:AD:VAF:PL      0/1:56:177:77,100:0.564972:58,0,60

GLNexus version: 1.3.1 (docker) DeepVariant version: 1.0 (docker)

GLNexus call:

/usr/local/bin/glnexus_cli \
                        --config DeepVariantWES \
                        --bed $bed \
                        $gvcfs | bcftools view - | bgzip -c > $merged_vcf
mlin commented 3 years ago

It looks to me like the CT>C deletion did not pass the default allele quality filter applied in the DeepVariantWES configuration. You can try running with the DeepVariant_unfiltered configuration to confirm if that's true. In a larger cohort, another sample might exhibit the allele more confidently and thereby "rescue" its report (for all samples) in the joint VCF.

If the deletion doesn't show up in the unfiltered mode, there are a couple other things we might look into, such as do we happen to be at the edge of one of the BED target regions, or that the haplotype in question (15174241 TGG>CG) can be written in a few different ways synonymously.

marchoeppner commented 3 years ago

Thanks for the pointer.

This is what happens with the suggested "unfiltered" config:

chr19   15174240        chr19_15174241_T_C;chr19_15174240_CT_C  CT      CC,C    66      .       AF=0.8125,0.0625;AQ=66,20       GT:DP:AD:GQ:PL:RNC      1/1:112:0,112,0:61:65,63,0,990,990,990:..       1/1:169:0,169,0:58:61,60,0,990,990,990:..       1/2:78:.,78,99:8:0,0,0,0,0,0:..       1/1:208:0,208,0:61:66,63,0,990,990,990:..       1/1:94:0,94,0:56:58,60,0,990,990,990:.. 0/1:196:96,99,0:53:52,0,68,990,990,990:..       1/1:156:0,156,0:57:59,60,0,990,990,990:..       0/1:202:105,97,0:53:53,0,63,990,990,990:..
chr19   15174242        chr19_15174242_G_C      G       C       58      .       AF=0.0625;AQ=58 GT:DP:AD:GQ:PL:RNC      0/0:112:112,0:50:0,300,2999:..  0/0:170:170,0:50:0,300,2999:..  0/1:177:77,100:56:58,0,60:..    0/0:207:207,0:50:0,300,2999:..  0/0:93:93,0:50:0,282,2819:..  0/0:198:198,0:50:0,300,2999:..  0/0:155:155,0:50:0,300,2999:..  0/0:204:204,0:50:0,300,2999:..

So the variant re-appears...which is good. It just seems odd that it would be filtered, since the signal is very clear :(

mlin commented 3 years ago

GLnexus' filtering is based on DeepVariant's GQ and PL scores shown in your gVCF excerpt, which in this case are not great, not terrible (phred score ~20). The DeepVariant_unfiltered mode may be more suitable for running with only a handful of samples as you illustrate, where the joint VCF file size and false positive rate aren't such pressing concerns. In a larger cohort, the cumulative evidence from seeing the allele in more samples would probably overcome the DeepVariantWES reporting threshold (even if each individual observation is marginal for whatever reason).

We can separately ask why DeepVariant assigned marginal quality scores when the IGV looks pretty good. That's usually difficult to answer in view of its complex model, but I do wonder if it has something to do with the way the haplotype can be expressed in a few different ways (as VCF allelic primitives).