Closed ghost closed 4 years ago
Hi, that certainly sounds like something we want to look into, but there's not enough information to go on right now. If you can please share the pertinent lines of the input gVCF file(s) and the output pVCF, we can study them further.
I checked and found one of the sample that is I believe causing the CT call:
As you see, no T anyway (it's the last C starting from the right)
zgrep -w -C5 "17434065" H4A4.g.vcf.gz
chromosome_1 17434058 . T <*> 0 . END=17434058 GT:GQ:MIN_DP:PL 0/0:50:63:0,189,1889
chromosome_1 17434059 . G T,<*> 25 PASS . GT:GQ:DP:AD:VAF:PL 0/1:25:63:37,26,0:0.412698,0:25,0,77,990,990,990
chromosome_1 17434060 . A <*> 0 . END=17434062 GT:GQ:MIN_DP:PL 0/0:50:64:0,165,1889
chromosome_1 17434063 . G C,<*> 26.2 PASS . GT:GQ:DP:AD:VAF:PL 0/1:26:64:38,25,0:0.390625,0:26,0,99,990,990,990
chromosome_1 17434064 . A <*> 0 . END=17434064 GT:GQ:MIN_DP:PL 0/0:50:64:0,192,1919
chromosome_1 17434065 . C T,<*> 29.1 PASS . GT:GQ:DP:AD:VAF:PL 0/1:29:64:38,26,0:0.40625,0:29,0,99,990,990,990
chromosome_1 17434066 . C <*> 0 . END=17434068 GT:GQ:MIN_DP:PL 0/0:50:62:0,186,1859
chromosome_1 17434069 . G A,<*> 27.3 PASS . GT:GQ:DP:AD:VAF:PL 0/1:27:63:38,25,0:0.396825,0:27,0,99,990,990,990
chromosome_1 17434070 . T <*> 0 . END=17434070 GT:GQ:MIN_DP:PL 0/0:50:63:0,189,1889
chromosome_1 17434071 . C A,<*> 28.3 PASS . GT:GQ:DP:AD:VAF:PL 0/1:28:59:36,23,0:0.38983,0:28,0,99,990,990,990
chromosome_1 17434072 . C <*> 0 . END=17434073 GT:GQ:MIN_DP:PL 0/0:50:61:0,183,1829
And here in the Nexus pVCF
zgrep -C 4 "17434065" ALL.nexus.vcf.gz
chromosome_1 17434049 chromosome_1_17434049_T_G T G 29 . AF=0.041667;AQ=29 GT:DP:AD:GQ:PL:RNC 0/0:387:247,140:28:0,27,61:.. ./.:68:51,17:18:0,18,55:II 0/0:73:42,31:20:0,19,62:.. 0/0:103:70,33:24:0,23,68:.. ./.:84:55,29:12:0,11,51:II 0/0:93:52,41:37:0,37,49:.. 0/0:66:38,28:31:0,30,54:.. ./.:236:137,99:15:0,15,54:II 0/1:62:36,25:29:29,0,99:.. 0/0:335:191,144:20:0,20,58:.. 0/0:361:231,129:42:0,42,65:.. 0/0:303:192,111:43:0,43,69:..
chromosome_1 17434057 chromosome_1_17434057_G_T G T 25 . AF=0.041667;AQ=25 GT:DP:AD:GQ:PL:RNC 0/0:238:238,0:50:0,210,2819:.. 0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:.. 0/0:67:67,0:50:0,123,1949:.. 0/0:48:48,0:50:0,129,1529:.. 0/0:54:54,0:50:0,165,1649:.. 0/0:35:35,0:50:0,108,1079:.. 0/0:137:137,0:50:0,300,2999:.. 0/1:59:33,26:26:25,0,71:.. 0/0:183:183,0:50:0,300,2999:.. 0/0:215:215,0:50:0,300,2999:.. 0/0:180:180,0:50:0,240,2879:..
chromosome_1 17434059 chromosome_1_17434059_G_T G T 25 . AF=0.041667;AQ=25 GT:DP:AD:GQ:PL:RNC 0/0:238:238,0:50:0,210,2819:.. 0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:.. 0/0:67:67,0:50:0,123,1949:.. 0/0:48:48,0:50:0,129,1529:.. 0/0:54:54,0:50:0,165,1649:.. 0/0:35:35,0:50:0,108,1079:.. 0/0:137:137,0:50:0,300,2999:.. 0/1:63:37,26:25:25,0,77:.. 0/0:183:183,0:50:0,300,2999:.. 0/0:215:215,0:50:0,300,2999:.. 0/0:180:180,0:50:0,240,2879:..
chromosome_1 17434063 chromosome_1_17434063_G_C G C 26 . AF=0.041667;AQ=26 GT:DP:AD:GQ:PL:RNC 0/0:238:238,0:50:0,210,2819:.. 0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:.. 0/0:67:67,0:50:0,123,1949:.. 0/0:48:48,0:50:0,129,1529:.. 0/0:54:54,0:50:0,165,1649:.. 0/0:35:35,0:50:0,108,1079:.. 0/0:137:137,0:50:0,300,2999:.. 0/1:64:38,25:26:26,0,99:.. 0/0:183:183,0:50:0,300,2999:.. 0/0:215:215,0:50:0,300,2999:.. 0/0:180:180,0:50:0,240,2879:..
chromosome_1 17434065 chromosome_1_17434065_C_T C T 29 . AF=0.041667;AQ=29 GT:DP:AD:GQ:PL:RNC 0/0:238:238,0:50:0,210,2819:.. 0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:.. 0/0:67:67,0:50:0,123,1949:.. 0/0:48:48,0:50:0,129,1529:.. 0/0:54:54,0:50:0,165,1649:.. 0/0:35:35,0:50:0,108,1079:.. 0/0:137:137,0:50:0,300,2999:.. 0/1:64:38,26:29:29,0,99:.. 0/0:183:183,0:50:0,300,2999:.. 0/0:215:215,0:50:0,300,2999:.. 0/0:180:180,0:50:0,240,2879:..
chromosome_1 17434069 chromosome_1_17434069_G_A G A 27 . AF=0.041667;AQ=27 GT:DP:AD:GQ:PL:RNC 0/0:238:238,0:50:0,210,2819:.. 0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:.. 0/0:67:67,0:50:0,123,1949:.. 0/0:48:48,0:50:0,129,1529:.. 0/0:54:54,0:50:0,165,1649:.. 0/0:35:35,0:50:0,108,1079:.. 0/0:137:137,0:50:0,300,2999:.. 0/1:63:38,25:27:27,0,99:.. 0/0:183:183,0:50:0,300,2999:.. 0/0:215:215,0:50:0,300,2999:.. 0/0:180:180,0:50:0,240,2879:..
chromosome_1 17434071 chromosome_1_17434071_C_A C A 28 . AF=0.041667;AQ=28 GT:DP:AD:GQ:PL:RNC 0/0:238:238,0:50:0,210,2819:.. 0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:.. 0/0:67:67,0:50:0,123,1949:.. 0/0:48:48,0:50:0,129,1529:.. 0/0:54:54,0:50:0,165,1649:.. 0/0:35:35,0:50:0,108,1079:.. 0/0:137:137,0:50:0,300,2999:.. 0/1:59:36,23:28:28,0,99:.. 0/0:183:183,0:50:0,300,2999:.. 0/0:215:215,0:50:0,300,2999:.. 0/0:180:180,0:50:0,240,2879:..
chromosome_1 17434074 chromosome_1_17434074_A_C A C 29 . AF=0.041667;AQ=29 GT:DP:AD:GQ:PL:RNC 0/0:238:238,0:50:0,210,2819:.. 0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:.. 0/0:67:67,0:50:0,123,1949:.. 0/0:48:48,0:50:0,129,1529:.. 0/0:54:54,0:50:0,165,1649:.. 0/0:35:35,0:50:0,108,1079:.. 0/0:137:137,0:50:0,300,2999:.. 0/1:63:40,23:29:29,0,80:.. 0/0:183:183,0:50:0,300,2999:.. 0/0:215:215,0:50:0,300,2999:.. 0/0:180:180,0:50:0,240,2879:..
chromosome_1 17434075 chromosome_1_17434075_T_C T C 24 . AF=0.041667;AQ=24 GT:DP:AD:GQ:PL:RNC 0/0:238:238,0:50:0,210,2819:.. 0/0:49:49,0:50:0,153,1529:..0/0:41:41,0:50:0,108,1319:.. 0/0:67:67,0:50:0,123,1949:.. 0/0:48:48,0:50:0,129,1529:.. 0/0:54:54,0:50:0,165,1649:.. 0/0:35:35,0:50:0,108,1079:.. 0/0:137:137,0:50:0,300,2999:.. 0/1:63:40,23:25:24,0,75:.. 0/0:183:183,0:50:0,300,2999:.. 0/0:215:215,0:50:0,300,2999:.. 0/0:180:180,0:50:0,240,2879:..
I notice it gets a quite "good" QUAL score, so it seems GLnexus doesn't see it is a bogus variant (I don't know if GLnexus should be able to ).
Are we talking about position 17434065 in H4A4? The gVCF appears to inform GLnexus that there are 26 reads (out of 64 total) supporting the T alternate allele at that position (AD=38,26
).
In the alignment visualization, I'm still not quite sure which position we're looking at, but on the right side it doesn't look like there are quite 64 reads piled up. Is it possible there are more reads not being displayed, e.g. scrolled off screen or something? Or do the red boxes down the middle-ish of the image indicate soft-clipping?
Hello, yes that's exactly that position, on the image it's the second base starting from the right edge.
It's indeed a crop that you see there are more reads but still less than 64. The read boxes are indeed clipping. The base call is so completely off that I spent yesterday evening checking if I did not simply erroneously name a file somewhere in my pipeline but it doesn't seem so.
I opened the same issue on DeepVariant github because it seems all this comes from the variant caller, originally. I found consistent patterns of erroneous calls around regions displaying read clipping. I am wondering if this is not DeepVariant telling us those regions are misassembled (just an hypothesis).
Not having seen the data personally I can't say for sure, but the probable explanation is that DeepVariant (like GATK HaplotypeCaller before it) locally realigns reads to candidate haplotypes -- and so, it ends up actually looking at different alignments/pileups than we see when we open the BAM/CRAM file. See pp. 13-15 of the HaplotypeCaller manuscript for some explanation of how this works. That process can lead to alignment & analysis of portions of reads that had been soft-clipped by the global aligner, which is great but certainly can be confusing when you try to reconcile the results with the original BAM/CRAM alignments.
Anyway I think this doesn't appear to be a GLnexus issue, but do reopen if otherwise 😎
Hello, I performed Joint calling from DeepVariant output, so far so good. But I noticed a line where one sample is supposed to be 0/1 for a C:T. Eyeballing the bam file, however ... There is simply no T (it's the column just adjacent to the left T). I checked in other samples and in some of them there are a few T where yo would expect for a SNP call but their original gvcf from DeepVariant classifies them as 0/0 (rightly so on my opinion there are very few of them).
So not only has GLnexus decided that there was a real variant there, but it assigned it to the samples where there is no C:T variation at all.
Do you have any idea of why this happens?