Closed ghost closed 11 months ago
Reopening the issue, the error in the code was inconsequential and corrected in the first message.
pileup_guppy.vcf
and pileup_dorado.vcf
both show four variants identified but light colored (low variant QUAL?) can you show the records of the four variants in pileup_guppy.vcf
and pileup_dorado.vcf
?
Yes sure, guppy
zcat pileup_guppy.vcf.gz|grep "Chrom_3"|grep "9635425"
Chrom_3 9635425 . A . 15.49 RefCall P GT:GQ:DP:AD:AF:PL 0/0:15:147:62:0.4218:990
zcat pileup_guppy.vcf.gz|grep "Chrom_3"|grep "9635468"
Chrom_3 9635468 . G . 11.43 RefCall P GT:GQ:DP:AD:AF:PL 0/0:11:147:59:0.4014:990
zcat pileup_guppy.vcf.gz|grep "Chrom_3"|grep "9635470"
Chrom_3 9635470 . T . 14.74 RefCall P GT:GQ:DP:AD:AF:PL 0/0:14:148:61:0.4122:990
zcat pileup_guppy.vcf.gz|grep "Chrom_3"|grep "9635492"
Chrom_3 9635492 . G . 6.68 RefCall P GT:GQ:DP:AD:AF:PL 0/0:6:148:61:0.4122:990
dorado
zcat pileup_dorado.vcf.gz|grep "Chrom_3"|grep "9635425"
Chrom_3 9635425 . A . 13.38 RefCall P GT:GQ:DP:AD:AF:PL 0/0:13:147:62:0.4218:990
zcat pileup_dorado.vcf.gz|grep "Chrom_3"|grep "9635468"
Chrom_3 9635468 . G . 9.62 RefCall P GT:GQ:DP:AD:AF:PL 0/0:9:147:59:0.4014:990
zcat pileup_dorado.vcf.gz|grep "Chrom_3"|grep "9635470"
Chrom_3 9635470 . T . 4.35 RefCall P GT:GQ:DP:AD:AF:PL 0/0:4:148:61:0.4122:990
zcat pileup_dorado.vcf.gz|grep "Chrom_3"|grep "9635492"
Chrom_3 9635492 . G . 13.51 RefCall P GT:GQ:DP:AD:AF:PL 0/0:13:148:61:0.4122:990
I don't understand, it seems like the vcf agrees there is something but it doesn't call the base. EDIT: I am assuming it's why we have a light blue like this, honestly I have never seen that before and IGV doc is not very easy to read, especially regarding their color coding.
I have tried to rerun in pileup only, and it's the same. The vcf makes an entry but then rejects it.
zcat pileup.vcf.gz|grep "Chrom_3"|grep "9635425"
zcat pileup.vcf.gz|grep "Chrom_3"|grep "9635468"
zcat pileup.vcf.gz|grep "Chrom_3"|grep "9635470"
zcat pileup.vcf.gz|grep "Chrom_3"|grep "9635492"
Chrom_3 9635425 . A . 15.49 RefCall P GT:GQ:DP:AD:AF:PL 0/0:15:147:62:0.4218:990
Chrom_3 9635468 . G . 11.43 RefCall P GT:GQ:DP:AD:AF:PL 0/0:11:147:59:0.4014:990
Chrom_3 9635470 . T . 14.74 RefCall P GT:GQ:DP:AD:AF:PL 0/0:14:148:61:0.4122:990
Chrom_3 9635492 . G . 6.68 RefCall P GT:GQ:DP:AD:AF:PL 0/0:6:148:61:0.4122:990
I don't understand on what basis the calls are rejected. I could send you the data if you want. Why are the PL so high?
As I understand, QUAL values are not low enough to cause a rejection, correct?
Thanks
It seems that the pileup model has decided not to trust these variant candidates, can you do a samtools pileup to check the base qualities at these four sites?
Yes, here is the command
bcftools mpileup -f ref.fa -R region.bed -o test_pileup.vcf -O v H2B4.sorted.bam
./check.sh
Chrom_3 9635425 . A G,<*> 0 . DP=153;I16=36,30,49,38,3960,237600,5220,313200,3960,237600,5220,313200,1650,41250,2163,53919;QS=0.431373,0.568627,0;VDB=0.159833;SGB=-0.693147;RPBZ=0.0534239;MQBZ=0;MQSBZ=0;BQBZ=0;SCBZ=0.0254381;MQ0F=0 PL:DP 255,0,255,255,255,255:153
Chrom_3 9635468 . G A,C,<*> 0 . DP=149;I16=35,29,47,38,3840,230400,5100,306000,3840,230400,5100,306000,1600,40000,2125,53125;QS=0.42953,0.563758,0.00671141,0;VDB=0.0880046;SGB=-0.693147;RPBZ=-0.180272;MQBZ=0;MQSBZ=0;BQBZ=0;SCBZ=-0.343401;MQ0F=0 PL:DP 255,0,255,255,255,255,255,255,255,255:149
Chrom_3 9635470 . T C,<*> 0 . DP=150;I16=35,30,47,38,3900,234000,5100,306000,3900,234000,5100,306000,1600,40000,2125,53125;QS=0.433333,0.566667,0;VDB=0.0857425;SGB=-0.693147;RPBZ=-0.0284505;MQBZ=0;MQSBZ=0;BQBZ=0;SCBZ=-0.323955;MQ0F=0 PL:DP 255,0,255,255,255,255:150
Chrom_3 9635492 . G A,<*> 0 . DP=154;I16=36,29,49,40,3900,234000,5340,320400,3900,234000,5340,320400,1625,40625,2222,55484;QS=0.422078,0.577922,0;VDB=0.190081;SGB=-0.693147;RPBZ=-0.367732;MQBZ=0;MQSBZ=0;BQBZ=0;SCBZ=-0.381925;MQ0F=0 PL:DP 255,0,255,255,255,255:154
(the check.sh simply contains my grep pipes).
EDIT: what metrics exactly interest you here? Let me know if you need more in-depth analysis or something specific, I am not totally sure of "base qualities" (what do you think it covers? )
I saw that you were using bcftools mpileup
, could you please try samtools mpileup
Sorry, I am so used to using bcftools mpileup that I misread your question. Here it is
samtools mpileup -l region.bed -f ref.fa H2B4.sorted.bam > pileup.outfile
./check.sh
Chrom_3 9635425 A 153 gG..,,,g.gGgG,GGggG.g.gG.gGgG..,ggG,g.,..gGG,g,....gGGGg,G,.,,ggg.gG,..,GGg.G.GgG.,GgGgG.g,,,GGg,G.G,GGg.,GGGggGG....GG.GGGG,.,Gg.g,.gg,.,gGg,GG,.GgG,.Gg ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Chrom_3 9635468 G 153 aA..,,,a.aAa*,AAaa.-1C.a.aAAaAaA..,aaC,a.,..aAA,a,....aAAAa,A,.,,aaa.aA,..,AAa.A.AaA.,AaAaA.a,*,A*a,A.A,AAa.,AAAaaAAA...AA.AAAA,.,Aa.a,.aa,.,aAa,AA,.*aA,.Aa ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Chrom_3 9635470 T 154 cC..,,,c.cCc*,CCcc..c.cCCcCcC..,ccC,c.,..cCC,c,....cCCCc,C,.,,ccc.cC,..,CCc.C.CcC.,CcCcC.c,*,C*c,C.C,CCc.,CCCccCCC...CC.CCCC,.,Cc.c,.cc,.,cCc,CC,.*cC,.Cc^], ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Chrom_3 9635492 G 154 aA..,,,a.aAaA,AAaa..a.aAAaAaA..,aaA,a.,..aA.,a,....aAAAa,A,.,,aaa.aA,..,AAa.A.AaA.,AaAaA.a,a,AAa,A.A,AAa.,AAAaaAAA...AA.AAAA,.,Aa.a,.aa,.,aAa,AA,.AaA,.Aaa ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The base qualities don't look right.
I am sorry, can you elaborate? I am not so used to interpret this, in ASCII alphabet this is the separator value, which I guess is what is wrong here?
The base qualities are '\~' for all bases as shown in your pileup. '\~' is the largest value of a singed 8-bit integer and is not a sensible base quality in all sequencing technologies.
Ok, thanks, so if I follow, this points to a problem with basecalling from the nanopore cell data, right?
Please double check the base qualities of the reads covering the four variants of interest. I don't know the source of your data so I don't know exactly what caused the wrong base qualities.
Please double check the base qualities of the reads covering the four variants of interest. I don't know the source of your data so I don't know exactly what caused the wrong base qualities.
It's in-house sequencing; we are going to investigate. Thank you for your constant help throughout this. I have to say I am puzzled. I close the issue for now, will re-open if necessary. Thanks again
Hello, see the screenshot. I have named the track guppy, dorado, hifi, for the calls done on the ONT with the guppy model, the calls done with the dorado model, and the calls done with the hifi model. I have uploaded the pileup, alignment, and merged vcf files (except for the hifi, otherwise, it becomes very cumbersome to read).
As you see, there are, visually, 4 SNPs, and Clair3 calls them 4 with the HiFi model. However, it calls only 1 with the ONT model. I have used two models for ONT because of this issue: https://github.com/HKU-BAL/Clair3/issues/248
However, the SNPs are called in the ONT datasets at the pileup level. But it seems then they are discarded. Do you know why the model would toss with ONT when it doesn't with PacBio HiFi? Could the parameter be tuned to avoid this behaviour? Just so you know, I am assuming the HiFi is the correct call, but I have no truth set, it could be the ONT models that are correct. Visually, the HiFi model seems to be what a human would call.
The ONT called was performed with r10.4.1 5 Khz. Here are the commands I used for the output in the pic:
Do you have an idea? My intuition is that there is a mismatch between the cell tech and the model that prevents Clair3 to call what look like "very obvious" SNPs. This pattern shown here is repeated throughout the assembly, it seems.
Do you think we should redo the base call with Dorado? Even if we have a first quality peak at 5, as shown in issue #248 ? As background information it's a non-model organism with a SNP frequency of 1 every dozen bases (usually 1/30 to 1/50, with regions of high density).
Thanks a lot. I am curious to hear your opinion on why the neural net ditches those variants.
EDIT: might be an error in the parameters, closing for now