dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
149 stars 38 forks source link

GLnexus assigns a non existant SNP #212

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hello, I performed Joint calling from DeepVariant output, so far so good. But I noticed a line where one sample is supposed to be 0/1 for a C:T. Eyeballing the bam file, however ... imT There is simply no T (it's the column just adjacent to the left T). I checked in other samples and in some of them there are a few T where yo would expect for a SNP call but their original gvcf from DeepVariant classifies them as 0/0 (rightly so on my opinion there are very few of them).

So not only has GLnexus decided that there was a real variant there, but it assigned it to the samples where there is no C:T variation at all.

Do you have any idea of why this happens?

mlin commented 4 years ago

Hi, that certainly sounds like something we want to look into, but there's not enough information to go on right now. If you can please share the pertinent lines of the input gVCF file(s) and the output pVCF, we can study them further.

ghost commented 4 years ago

I checked and found one of the sample that is I believe causing the CT call:

CT As you see, no T anyway (it's the last C starting from the right)

zgrep -w -C5 "17434065" H4A4.g.vcf.gz 
chromosome_1    17434058    .   T   <*> 0   .   END=17434058    GT:GQ:MIN_DP:PL 0/0:50:63:0,189,1889
chromosome_1    17434059    .   G   T,<*>   25  PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:25:63:37,26,0:0.412698,0:25,0,77,990,990,990
chromosome_1    17434060    .   A   <*> 0   .   END=17434062    GT:GQ:MIN_DP:PL 0/0:50:64:0,165,1889
chromosome_1    17434063    .   G   C,<*>   26.2    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:26:64:38,25,0:0.390625,0:26,0,99,990,990,990
chromosome_1    17434064    .   A   <*> 0   .   END=17434064    GT:GQ:MIN_DP:PL 0/0:50:64:0,192,1919
chromosome_1    17434065    .   C   T,<*>   29.1    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:29:64:38,26,0:0.40625,0:29,0,99,990,990,990
chromosome_1    17434066    .   C   <*> 0   .   END=17434068    GT:GQ:MIN_DP:PL 0/0:50:62:0,186,1859
chromosome_1    17434069    .   G   A,<*>   27.3    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:27:63:38,25,0:0.396825,0:27,0,99,990,990,990
chromosome_1    17434070    .   T   <*> 0   .   END=17434070    GT:GQ:MIN_DP:PL 0/0:50:63:0,189,1889
chromosome_1    17434071    .   C   A,<*>   28.3    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:28:59:36,23,0:0.38983,0:28,0,99,990,990,990
chromosome_1    17434072    .   C   <*> 0   .   END=17434073    GT:GQ:MIN_DP:PL 0/0:50:61:0,183,1829

And here in the Nexus pVCF

zgrep -C 4 "17434065" ALL.nexus.vcf.gz
chromosome_1    17434049    chromosome_1_17434049_T_G   T   G   29  .   AF=0.041667;AQ=29   GT:DP:AD:GQ:PL:RNC  0/0:387:247,140:28:0,27,61:..   ./.:68:51,17:18:0,18,55:II  0/0:73:42,31:20:0,19,62:..  0/0:103:70,33:24:0,23,68:.. ./.:84:55,29:12:0,11,51:II  0/0:93:52,41:37:0,37,49:..  0/0:66:38,28:31:0,30,54:..  ./.:236:137,99:15:0,15,54:II    0/1:62:36,25:29:29,0,99:..  0/0:335:191,144:20:0,20,58:..   0/0:361:231,129:42:0,42,65:..   0/0:303:192,111:43:0,43,69:..
chromosome_1    17434057    chromosome_1_17434057_G_T   G   T   25  .   AF=0.041667;AQ=25   GT:DP:AD:GQ:PL:RNC  0/0:238:238,0:50:0,210,2819:..  0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:..    0/0:67:67,0:50:0,123,1949:..    0/0:48:48,0:50:0,129,1529:..    0/0:54:54,0:50:0,165,1649:..    0/0:35:35,0:50:0,108,1079:..    0/0:137:137,0:50:0,300,2999:..  0/1:59:33,26:26:25,0,71:..  0/0:183:183,0:50:0,300,2999:..  0/0:215:215,0:50:0,300,2999:..  0/0:180:180,0:50:0,240,2879:..
chromosome_1    17434059    chromosome_1_17434059_G_T   G   T   25  .   AF=0.041667;AQ=25   GT:DP:AD:GQ:PL:RNC  0/0:238:238,0:50:0,210,2819:..  0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:..    0/0:67:67,0:50:0,123,1949:..    0/0:48:48,0:50:0,129,1529:..    0/0:54:54,0:50:0,165,1649:..    0/0:35:35,0:50:0,108,1079:..    0/0:137:137,0:50:0,300,2999:..  0/1:63:37,26:25:25,0,77:..  0/0:183:183,0:50:0,300,2999:..  0/0:215:215,0:50:0,300,2999:..  0/0:180:180,0:50:0,240,2879:..
chromosome_1    17434063    chromosome_1_17434063_G_C   G   C   26  .   AF=0.041667;AQ=26   GT:DP:AD:GQ:PL:RNC  0/0:238:238,0:50:0,210,2819:..  0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:..    0/0:67:67,0:50:0,123,1949:..    0/0:48:48,0:50:0,129,1529:..    0/0:54:54,0:50:0,165,1649:..    0/0:35:35,0:50:0,108,1079:..    0/0:137:137,0:50:0,300,2999:..  0/1:64:38,25:26:26,0,99:..  0/0:183:183,0:50:0,300,2999:..  0/0:215:215,0:50:0,300,2999:..  0/0:180:180,0:50:0,240,2879:..
chromosome_1    17434065    chromosome_1_17434065_C_T   C   T   29  .   AF=0.041667;AQ=29   GT:DP:AD:GQ:PL:RNC  0/0:238:238,0:50:0,210,2819:..  0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:..    0/0:67:67,0:50:0,123,1949:..    0/0:48:48,0:50:0,129,1529:..    0/0:54:54,0:50:0,165,1649:..    0/0:35:35,0:50:0,108,1079:..    0/0:137:137,0:50:0,300,2999:..  0/1:64:38,26:29:29,0,99:..  0/0:183:183,0:50:0,300,2999:..  0/0:215:215,0:50:0,300,2999:..  0/0:180:180,0:50:0,240,2879:..
chromosome_1    17434069    chromosome_1_17434069_G_A   G   A   27  .   AF=0.041667;AQ=27   GT:DP:AD:GQ:PL:RNC  0/0:238:238,0:50:0,210,2819:..  0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:..    0/0:67:67,0:50:0,123,1949:..    0/0:48:48,0:50:0,129,1529:..    0/0:54:54,0:50:0,165,1649:..    0/0:35:35,0:50:0,108,1079:..    0/0:137:137,0:50:0,300,2999:..  0/1:63:38,25:27:27,0,99:..  0/0:183:183,0:50:0,300,2999:..  0/0:215:215,0:50:0,300,2999:..  0/0:180:180,0:50:0,240,2879:..
chromosome_1    17434071    chromosome_1_17434071_C_A   C   A   28  .   AF=0.041667;AQ=28   GT:DP:AD:GQ:PL:RNC  0/0:238:238,0:50:0,210,2819:..  0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:..    0/0:67:67,0:50:0,123,1949:..    0/0:48:48,0:50:0,129,1529:..    0/0:54:54,0:50:0,165,1649:..    0/0:35:35,0:50:0,108,1079:..    0/0:137:137,0:50:0,300,2999:..  0/1:59:36,23:28:28,0,99:..  0/0:183:183,0:50:0,300,2999:..  0/0:215:215,0:50:0,300,2999:..  0/0:180:180,0:50:0,240,2879:..
chromosome_1    17434074    chromosome_1_17434074_A_C   A   C   29  .   AF=0.041667;AQ=29   GT:DP:AD:GQ:PL:RNC  0/0:238:238,0:50:0,210,2819:..  0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:..    0/0:67:67,0:50:0,123,1949:..    0/0:48:48,0:50:0,129,1529:..    0/0:54:54,0:50:0,165,1649:..    0/0:35:35,0:50:0,108,1079:..    0/0:137:137,0:50:0,300,2999:..  0/1:63:40,23:29:29,0,80:..  0/0:183:183,0:50:0,300,2999:..  0/0:215:215,0:50:0,300,2999:..  0/0:180:180,0:50:0,240,2879:..
chromosome_1    17434075    chromosome_1_17434075_T_C   T   C   24  .   AF=0.041667;AQ=24   GT:DP:AD:GQ:PL:RNC  0/0:238:238,0:50:0,210,2819:..  0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:..    0/0:67:67,0:50:0,123,1949:..    0/0:48:48,0:50:0,129,1529:..    0/0:54:54,0:50:0,165,1649:..    0/0:35:35,0:50:0,108,1079:..    0/0:137:137,0:50:0,300,2999:..  0/1:63:40,23:25:24,0,75:..  0/0:183:183,0:50:0,300,2999:..  0/0:215:215,0:50:0,300,2999:..  0/0:180:180,0:50:0,240,2879:..

I notice it gets a quite "good" QUAL score, so it seems GLnexus doesn't see it is a bogus variant (I don't know if GLnexus should be able to ).

mlin commented 4 years ago

Are we talking about position 17434065 in H4A4? The gVCF appears to inform GLnexus that there are 26 reads (out of 64 total) supporting the T alternate allele at that position (AD=38,26).

In the alignment visualization, I'm still not quite sure which position we're looking at, but on the right side it doesn't look like there are quite 64 reads piled up. Is it possible there are more reads not being displayed, e.g. scrolled off screen or something? Or do the red boxes down the middle-ish of the image indicate soft-clipping?

ghost commented 4 years ago

Hello, yes that's exactly that position, on the image it's the second base starting from the right edge.

It's indeed a crop that you see there are more reads but still less than 64. The read boxes are indeed clipping. The base call is so completely off that I spent yesterday evening checking if I did not simply erroneously name a file somewhere in my pipeline but it doesn't seem so.

I opened the same issue on DeepVariant github because it seems all this comes from the variant caller, originally. I found consistent patterns of erroneous calls around regions displaying read clipping. I am wondering if this is not DeepVariant telling us those regions are misassembled (just an hypothesis).

mlin commented 4 years ago

Not having seen the data personally I can't say for sure, but the probable explanation is that DeepVariant (like GATK HaplotypeCaller before it) locally realigns reads to candidate haplotypes -- and so, it ends up actually looking at different alignments/pileups than we see when we open the BAM/CRAM file. See pp. 13-15 of the HaplotypeCaller manuscript for some explanation of how this works. That process can lead to alignment & analysis of portions of reads that had been soft-clipped by the global aligner, which is great but certainly can be confusing when you try to reconcile the results with the original BAM/CRAM alignments.

Anyway I think this doesn't appear to be a GLnexus issue, but do reopen if otherwise 😎