HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
246 stars 27 forks source link

Send adjacent SNP and Indel candidates both to full-alignment model #260

Open haraldgrove opened 10 months ago

haraldgrove commented 10 months ago

I've been running Clair3 (v1.01) on some pig samples with ONT reads. I noticed that Clair3 has been assigning the "1/1" genotype to SNVs that have an allele frequency where I would expect an "0/1" genotype.

Example:

Ssc08   3369224 .       A       C       25.76   PASS    P       GT:GQ:DP:AD:AF:PL       1/1:25:42:22,20:0.4762:63,45,0
Ssc08   3369498 .       C       G       22.15   PASS    P       GT:GQ:DP:AD:AF:PL       1/1:22:43:20,20:0.4651:61,44,0
Ssc08   3374209 .       A       C       8.15    PASS    F       GT:GQ:DP:AD:AF:PL       1/1:8:49:24,23:0.4694:0,29,2
Ssc08   3374212 .       G       A       5.95    PASS    F       GT:GQ:DP:AD:AF:PL       1/1:5:49:25,23:0.4694:1,22,0

Do you have any idea why this might be happening?

-Best regards Harald

aquaskyline commented 10 months ago

Clair3 concludes genotype based on AF and many other factors. Reads supporting the reference allele could be mistakes due to sequencing or alignment errors. If Clair3's model thinks many of the reads supporting the reference allele are more likely to be mistaken, it might conclude the genotype to be 1/1 instead of 0/1. Clair3 might or might not be correct about your examples, but please pay more attention to them because there exists evidence that discredits the reads supporting the reference allele.

haraldgrove commented 10 months ago

Thank you. I discovered that there was an uncalled deletion covering the same position. So the reference allele was not present at all. It's still a confusing variant, but maybe not much to do about it.

haraldgrove commented 10 months ago

A follow up to the issue I had with understanding the AF output from Clair3.

I found another location where Clair3 is calling a homozygous insertion, but where an inspection in IGV seems to strongly suggest it to be a heterozygous insertion instead. It also seems as if Clair3 is not discarding any reads as unreliable since the total number of reads is the same in both instances.

The variant: Ssc13 147177599 . T TTAA 21.09 PASS P GT:GQ:DP:AD:AF:PL 1/1:21:43:2,40:0.9302:63,40,0

I'm wondering if there are any way for me to know what Clair3 is basing its decision on? ssc13_frosk_147177599

zhengzhenxian commented 10 months ago

Hi, @haraldgrove,

Could you please provide us with the pileup result of the flanking 10bp windows? This will help us pinpoint the genotype issue more accurately. To obtain the mpileup result using samtools, you can use the following command:

samtools mpileup  --min-MQ 5 --min-BQ 0 --ff 2316 -r Ssc13:147177594-147177604 ${BAM}
aquaskyline commented 10 months ago

Thanks @haraldgrove, your case shows an insertion following a deletion immediately, and it looks like a bug in Clair3 to me. Besides the mpileup results, could you please also send us a minibam covering the case. Thanks!

haraldgrove commented 10 months ago

Thank you @aquaskyline and @zhengzhenxian for following up on this.

The mpileup output looks like this:

Ssc13   147177594       G       44      .,.,..,.,.,,....,,,,,,..,...,,,,.,,.,..,...,    C&:6:9<7A5;:<4?<263;A!?;%A15=6<8@!<!/;;?38<=
Ssc13   147177595       T       44      .,.,..,.,.,,....,,,,,,..c...,,,,.,,.,..,...,    @&9+<)=7A5=??.>=)5/:F!A>%A14?8>9A!<!0>:A4<=?
Ssc13   147177596       T       44      .,.,..,.,.,,....,,,,,,..,...,,,,.,,.,..,...,    A):(>&A7A2@>?)?=&6.9{!B?'@14@9>;@#<!1A:A;<=>
Ssc13   147177597       A       44      .,.,..,.,.+2GT,,....,,,,,,..,..-2AT.,,,,.,,.,..,...,    A*8)A%B8@+A@@(=>%6.;E!@@(@,5@6><A#=!1@;B5<6?
Ssc13   147177598       A       44      .-1T,-1t.,..,.-1T,.,-1t,-1t.-1T.-1T.-1T.,,-1t,,-1t,-1t,.-1T.,.-1T*.,,,,.-1T,,-1t.+2AT,-1t.-1TG,-1t..-1T.,-1t    @*8(B#A:<!==B);>#5)9D!A>(@*6@5<?@!8!0A2I3=+<
Ssc13   147177599       T       44      **.,.+3TAA.+3TAA,+3taa*,+3taa.*****.+3TAA,+3taa*,**,+3taa*.,**.,+3taa,+3taa,+3taa,+3taa*,*.+2AA**.+1G*.+3TAA*.+3TAA*    !!7'@"?!?!!!!!!?"!(!!!!='!*6@4=?!!!!!!2!3!+!
Ssc13   147177600       T       44      .,.,..,.,.,,....,,,,,,..,...,,,,.,+2aa,.,.G,...,        !!8&$!!!$!!!!!!$!!*!!!!?+!*7$$$$!!!!!!!!$!$!
Ssc13   147177601       T       44      .,.,..,.,.,,....,,,+1c,,,..,...,,,,.,,.,..,..., !$6(C!5$;!$!$$$?!$)$!!$>)$*6=064$!!!$$!!4$+$
Ssc13   147177602       T       44      .,C,..+1C,.,.-1C,,....,+1c,,,,,.C,...,,,,.,,.,.A,...,   !'7)@!;';!'''''B!')''!'>+'+18114'!'!''#'5'('
Ssc13   147177603       C       44      .+2AG,.g..,.,*,,....,,g,,,..,...,,,,.,+1t,.,..,...,     !,**B+:@>)<8{=D@-5)9@!{).C(47219>!4!1D%>3@+<
Ssc13   147177604       A       44      .,.t..,.,.,,....,,c,,,..,...,,,,.t,.,..,...,    /5)+6(;4>)?8C4<?.9)9?!:(.?&/73:;*!8!5,&?-6,<

The BAM file for the region is attached. ssc13_region.bam.gz

zhengzhenxian commented 10 months ago

Hi, @haraldgrove,

Thanks for providing the BAM, it seems the Clair3 pileup model made an incorrect zygosity prediction due to the insertion variant was followed by a deletion immediately.

Could you try to add --var_pct_full=1.0 option to feed all pileup results to a more reliable full-alignment model? We consider the full-alignment network would perform better in such cases.

aquaskyline commented 10 months ago

@haraldgrove please kindly let us know if --var_pct_full=1.0 gives the correct answer of your variants. If yes, we will force these type of variants to go into the full-alignment model to solve the problem in a long-run.

haraldgrove commented 10 months ago

Hi.

Thank you for the suggestion. The full-alignment mode fixed the problem.