HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
244 stars 26 forks source link

Clair3 gives a large number of false negatives (amplicon sequencing on minION) #238

Closed jamesa47 closed 11 months ago

jamesa47 commented 1 year ago

Hello, I am using Clair3 for variant calling on amplicon sequencing data from a minION, generally with a fair amount of success. However, for one particular amplicon, I get a large number of false negatives (I am sequencing NA12878 and comparing to the Illumina Platinum Genome as a truthset). In the attached IGV screenshot, all of the obvious heterozygous variants are identified as real by Illumina, but Clair3 only calls 7 out of 25. There are no obvious possible off-target sites that might be amplified.

These reads were basecalled using Dorado (using the appropriate dna_r10.4.1_e8.2_400bps_hac@v4.2.0 model for my run). I am using minimap2 for mapping (against a FASTA file containing my amplicon sequences. I had other amplicons in the run, but I used samtools view to only select the reads mapping against the problematic amplicon). For Clair3, I am using the r1041_e82_400bps_hac_v420 model (from Rerio). I am using --var_pct_full=0.1 --ref_pct_full=0.7 (to try to increase sensitivity, I may be using those flags incorrectly). The bam files are a bit large, but I can share them as well via e-mail. The merge output VCF is also attached. Thank you for your help!

FN-amplicon merge_output.vcf.gz

aquaskyline commented 1 year ago

Yes please send me the bam if possible.

jamesa47 commented 1 year ago

Thanks for your help! I e-mailed the bam file.

zhengzhenxian commented 11 months ago

As replied in the email, when working with amplicon data, we recommend disabling phasing by using the --no_phasing_for_fa option. This helps prevent incorrect phasing in specific amplicon regions. To maximize performance, we also suggest feeding all candidates into the full-alignment network by using the --var_pct_full=1.0 and --ref_pct_full=1.0 parameters.

zhengzhenxian commented 11 months ago

@jamesa47 I will close the issue, pls kindly reopen it if you have any problems, thanks!

Lipinski-B commented 5 months ago

Hello,

I come back to you here because I'm facing a similar issue. I'm using Clair3 on amplicon sequencing human data from a gridION too, here's all of the parameter :

_ Basecalling : Dorado 0.5.2 model dna_r10.4.1_e8.2_400bpssup@v4.3.0 Mapping : minimap2 Clair3 version : v1.0.7 Clair3 model : r1041_e82_400bps_supv430 Clair3 option : --var_pct_full=1 --ref_pct_full=1 --var_pct_phasing=1

With this configuration, I'm surprised to see that I miss an real variant for a barcode on my final gene target VCF. This variant appear in the pileup VCF but not in the full aligmnent VCF and the merge VCF.

But when I put on the --no_phasing_for_fa option as you suggest here, with the same configuration, I can finally see my variant in the full aligmnent VCF and the merge VCF.

So, here you suggest to use the --no_phasing_for_fa option, but you specify in the doc that "If you are dealing with human data, set --var_pct_phasing to 1. If you are dealing with non-human data, enable the --no_phasing_for_fa option."

My question is : what is the best practice to execute then ? What do you think about to put on the --no_phasing_for_fa option with human data?

Many thanks in advance for your answer. Best regards, Boris

aquaskyline commented 5 months ago

I believe you've found the best parameters for your usage.

Lipinski-B commented 5 months ago

Hi @aquaskyline,

Thanks for your answer.

Well, ok, but can you tell more about the --no_phasing_for_fa option ? As my question was telling : what do you think is the best practice to execute then ? What do you think about to put on the --no_phasing_for_fa option with human data ?

Or moreover, why do you recommand to enable the --no_phasing_for_fa option only for non-human data?

Many thanks in advance for your answer. Best regards, Boris

aquaskyline commented 5 months ago

amplicon sequencing is not a single sequencing scheme, it can use different panels and primers so the genomics regions included vary if they are not entirely different. Most of the regions in human genome benefit from phasing, while some regions don't. So different amplicon sequencing can end up suggesting a different best set of parameters.

When dealing with non-human amplicon sequencing data, various reasons including imperfect reference genome relatively lower sample quality might cause incorrect phasing, thus in turn taking away the benefits of using phased reads for variant calling.