HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
247 stars 27 forks source link

Variants called with HiFi but not with ONT in the merge.output.vcf #251

Closed ghost closed 11 months ago

ghost commented 11 months ago

Hello, see the screenshot. I have named the track guppy, dorado, hifi, for the calls done on the ONT with the guppy model, the calls done with the dorado model, and the calls done with the hifi model. I have uploaded the pileup, alignment, and merged vcf files (except for the hifi, otherwise, it becomes very cumbersome to read).

As you see, there are, visually, 4 SNPs, and Clair3 calls them 4 with the HiFi model. However, it calls only 1 with the ONT model. I have used two models for ONT because of this issue: https://github.com/HKU-BAL/Clair3/issues/248

However, the SNPs are called in the ONT datasets at the pileup level. But it seems then they are discarded. Do you know why the model would toss with ONT when it doesn't with PacBio HiFi? Could the parameter be tuned to avoid this behaviour? Just so you know, I am assuming the HiFi is the correct call, but I have no truth set, it could be the ONT models that are correct. Visually, the HiFi model seems to be what a human would call.

diff_hifi_ONT png

The ONT called was performed with r10.4.1 5 Khz. Here are the commands I used for the output in the pic:

MODEL_NAME="hifi_sequel2"
THREADS=24
OUTPUT_DIR=ref_Clair3
run_clair3.sh --bam_fn=ancestor_hifi.sorted.bam --ref_fn=ref.fa --output=${OUTPUT_DIR} --threads=${THREADS} --platform=hifi --model_path=/media/alessandro/Storage/MA/P2/ONT_fasta/models/${MODEL_NAME} --include_all_ctgs --gvcf

MODEL_NAME="r1041_e82_400bps_sup_v420_model"
THREADS=24
OUTPUT_DIR=Dorado_Clair3
mkdir -p ${OUTPUT_DIR}
run_clair3.sh --bam_fn=H2B4.sorted.bam --ref_fn=ref.fa --output=${OUTPUT_DIR} --threads=${THREADS} --platform="ont" --model_path=/media/alessandro/Storage/MA/P2/ONT_fasta/rerio/clair3_models/${MODEL_NAME} --include_all_ctgs --gvcf

MODEL_NAME="r1041_e82_400bps_sup_g615_model"
OUTPUT_DIR=Guppy_Clair3
mkdir -p ${OUTPUT_DIR}
run_clair3.sh --bam_fn=H2B4.sorted.bam --ref_fn=ref.fa --output=${OUTPUT_DIR} --threads=${THREADS} --platform="ont" --model_path=/media/alessandro/Storage/MA/P2/ONT_fasta/rerio/clair3_models/${MODEL_NAME} --include_all_ctgs --gvcf

Do you have an idea? My intuition is that there is a mismatch between the cell tech and the model that prevents Clair3 to call what look like "very obvious" SNPs. This pattern shown here is repeated throughout the assembly, it seems.

Do you think we should redo the base call with Dorado? Even if we have a first quality peak at 5, as shown in issue #248 ? As background information it's a non-model organism with a SNP frequency of 1 every dozen bases (usually 1/30 to 1/50, with regions of high density).

Thanks a lot. I am curious to hear your opinion on why the neural net ditches those variants.

EDIT: might be an error in the parameters, closing for now

ghost commented 11 months ago

Reopening the issue, the error in the code was inconsequential and corrected in the first message.

aquaskyline commented 11 months ago

pileup_guppy.vcf and pileup_dorado.vcf both show four variants identified but light colored (low variant QUAL?) can you show the records of the four variants in pileup_guppy.vcf and pileup_dorado.vcf?

ghost commented 11 months ago

Yes sure, guppy


zcat pileup_guppy.vcf.gz|grep "Chrom_3"|grep "9635425"
Chrom_3 9635425 .   A   .   15.49   RefCall P   GT:GQ:DP:AD:AF:PL   0/0:15:147:62:0.4218:990

 zcat pileup_guppy.vcf.gz|grep "Chrom_3"|grep "9635468" 
Chrom_3 9635468 .   G   .   11.43   RefCall P   GT:GQ:DP:AD:AF:PL   0/0:11:147:59:0.4014:990

zcat pileup_guppy.vcf.gz|grep "Chrom_3"|grep "9635470"
Chrom_3 9635470 .   T   .   14.74   RefCall P   GT:GQ:DP:AD:AF:PL   0/0:14:148:61:0.4122:990

zcat pileup_guppy.vcf.gz|grep "Chrom_3"|grep "9635492"     
Chrom_3 9635492 .   G   .   6.68    RefCall P   GT:GQ:DP:AD:AF:PL   0/0:6:148:61:0.4122:990

dorado

zcat pileup_dorado.vcf.gz|grep "Chrom_3"|grep "9635425"

Chrom_3 9635425 .   A   .   13.38   RefCall P   GT:GQ:DP:AD:AF:PL   0/0:13:147:62:0.4218:990

zcat pileup_dorado.vcf.gz|grep "Chrom_3"|grep "9635468"

Chrom_3 9635468 .   G   .   9.62    RefCall P   GT:GQ:DP:AD:AF:PL   0/0:9:147:59:0.4014:990

zcat pileup_dorado.vcf.gz|grep "Chrom_3"|grep "9635470"

Chrom_3 9635470 .   T   .   4.35    RefCall P   GT:GQ:DP:AD:AF:PL   0/0:4:148:61:0.4122:990

zcat pileup_dorado.vcf.gz|grep "Chrom_3"|grep "9635492"

Chrom_3 9635492 .   G   .   13.51   RefCall P   GT:GQ:DP:AD:AF:PL   0/0:13:148:61:0.4122:990

I don't understand, it seems like the vcf agrees there is something but it doesn't call the base. EDIT: I am assuming it's why we have a light blue like this, honestly I have never seen that before and IGV doc is not very easy to read, especially regarding their color coding.

ghost commented 11 months ago

I have tried to rerun in pileup only, and it's the same. The vcf makes an entry but then rejects it.

zcat pileup.vcf.gz|grep "Chrom_3"|grep "9635425"
zcat pileup.vcf.gz|grep "Chrom_3"|grep "9635468"
zcat pileup.vcf.gz|grep "Chrom_3"|grep "9635470"
zcat pileup.vcf.gz|grep "Chrom_3"|grep "9635492"
Chrom_3 9635425 .   A   .   15.49   RefCall P   GT:GQ:DP:AD:AF:PL   0/0:15:147:62:0.4218:990
Chrom_3 9635468 .   G   .   11.43   RefCall P   GT:GQ:DP:AD:AF:PL   0/0:11:147:59:0.4014:990
Chrom_3 9635470 .   T   .   14.74   RefCall P   GT:GQ:DP:AD:AF:PL   0/0:14:148:61:0.4122:990
Chrom_3 9635492 .   G   .   6.68    RefCall P   GT:GQ:DP:AD:AF:PL   0/0:6:148:61:0.4122:990

I don't understand on what basis the calls are rejected. I could send you the data if you want. Why are the PL so high?

As I understand, QUAL values are not low enough to cause a rejection, correct?

Thanks

aquaskyline commented 11 months ago

It seems that the pileup model has decided not to trust these variant candidates, can you do a samtools pileup to check the base qualities at these four sites?

ghost commented 11 months ago

Yes, here is the command

bcftools mpileup -f ref.fa -R region.bed -o test_pileup.vcf -O v H2B4.sorted.bam
./check.sh 
Chrom_3 9635425 .   A   G,<*>   0   .   DP=153;I16=36,30,49,38,3960,237600,5220,313200,3960,237600,5220,313200,1650,41250,2163,53919;QS=0.431373,0.568627,0;VDB=0.159833;SGB=-0.693147;RPBZ=0.0534239;MQBZ=0;MQSBZ=0;BQBZ=0;SCBZ=0.0254381;MQ0F=0   PL:DP   255,0,255,255,255,255:153
Chrom_3 9635468 .   G   A,C,<*> 0   .   DP=149;I16=35,29,47,38,3840,230400,5100,306000,3840,230400,5100,306000,1600,40000,2125,53125;QS=0.42953,0.563758,0.00671141,0;VDB=0.0880046;SGB=-0.693147;RPBZ=-0.180272;MQBZ=0;MQSBZ=0;BQBZ=0;SCBZ=-0.343401;MQ0F=0    PL:DP   255,0,255,255,255,255,255,255,255,255:149
Chrom_3 9635470 .   T   C,<*>   0   .   DP=150;I16=35,30,47,38,3900,234000,5100,306000,3900,234000,5100,306000,1600,40000,2125,53125;QS=0.433333,0.566667,0;VDB=0.0857425;SGB=-0.693147;RPBZ=-0.0284505;MQBZ=0;MQSBZ=0;BQBZ=0;SCBZ=-0.323955;MQ0F=0 PL:DP   255,0,255,255,255,255:150
Chrom_3 9635492 .   G   A,<*>   0   .   DP=154;I16=36,29,49,40,3900,234000,5340,320400,3900,234000,5340,320400,1625,40625,2222,55484;QS=0.422078,0.577922,0;VDB=0.190081;SGB=-0.693147;RPBZ=-0.367732;MQBZ=0;MQSBZ=0;BQBZ=0;SCBZ=-0.381925;MQ0F=0   PL:DP   255,0,255,255,255,255:154

(the check.sh simply contains my grep pipes).

EDIT: what metrics exactly interest you here? Let me know if you need more in-depth analysis or something specific, I am not totally sure of "base qualities" (what do you think it covers? )

aquaskyline commented 11 months ago

I saw that you were using bcftools mpileup, could you please try samtools mpileup

ghost commented 11 months ago

Sorry, I am so used to using bcftools mpileup that I misread your question. Here it is

samtools mpileup -l region.bed -f ref.fa H2B4.sorted.bam > pileup.outfile
./check.sh 
Chrom_3 9635425 A   153 gG..,,,g.gGgG,GGggG.g.gG.gGgG..,ggG,g.,..gGG,g,....gGGGg,G,.,,ggg.gG,..,GGg.G.GgG.,GgGgG.g,,,GGg,G.G,GGg.,GGGggGG....GG.GGGG,.,Gg.g,.gg,.,gGg,GG,.GgG,.Gg   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Chrom_3 9635468 G   153 aA..,,,a.aAa*,AAaa.-1C.a.aAAaAaA..,aaC,a.,..aAA,a,....aAAAa,A,.,,aaa.aA,..,AAa.A.AaA.,AaAaA.a,*,A*a,A.A,AAa.,AAAaaAAA...AA.AAAA,.,Aa.a,.aa,.,aAa,AA,.*aA,.Aa    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Chrom_3 9635470 T   154 cC..,,,c.cCc*,CCcc..c.cCCcCcC..,ccC,c.,..cCC,c,....cCCCc,C,.,,ccc.cC,..,CCc.C.CcC.,CcCcC.c,*,C*c,C.C,CCc.,CCCccCCC...CC.CCCC,.,Cc.c,.cc,.,cCc,CC,.*cC,.Cc^],    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Chrom_3 9635492 G   154 aA..,,,a.aAaA,AAaa..a.aAAaAaA..,aaA,a.,..aA.,a,....aAAAa,A,.,,aaa.aA,..,AAa.A.AaA.,AaAaA.a,a,AAa,A.A,AAa.,AAAaaAAA...AA.AAAA,.,Aa.a,.aa,.,aAa,AA,.AaA,.Aaa  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
aquaskyline commented 11 months ago

The base qualities don't look right.

ghost commented 11 months ago

I am sorry, can you elaborate? I am not so used to interpret this, in ASCII alphabet this is the separator value, which I guess is what is wrong here?

aquaskyline commented 11 months ago

The base qualities are '\~' for all bases as shown in your pileup. '\~' is the largest value of a singed 8-bit integer and is not a sensible base quality in all sequencing technologies.

ghost commented 11 months ago

Ok, thanks, so if I follow, this points to a problem with basecalling from the nanopore cell data, right?

aquaskyline commented 11 months ago

Please double check the base qualities of the reads covering the four variants of interest. I don't know the source of your data so I don't know exactly what caused the wrong base qualities.

ghost commented 11 months ago

Please double check the base qualities of the reads covering the four variants of interest. I don't know the source of your data so I don't know exactly what caused the wrong base qualities.

It's in-house sequencing; we are going to investigate. Thank you for your constant help throughout this. I have to say I am puzzled. I close the issue for now, will re-open if necessary. Thanks again