HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
243 stars 26 forks source link

Calculated allele frequency in VCF file #119

Closed SebastianMeyer1989 closed 2 years ago

SebastianMeyer1989 commented 2 years ago

Hello,

I am using Clair3 to call variants in microbial data. I have a position in my reference (the base T), where the bam-file (generated with minimap2) tells me, that I have a total read count of 268 reads. In this position, 6 of the reads have the alternative base C. In the VCE file, generated with Clair3, this position is now called as heterozygous with an allele frequency of 0.2222. EFAU004_01163 102 . T C 6.77 PASS F GT:GQ:DP:AF 0/1:6:9:0.2222

Can you tell me, what the reason could be for this high allele frequency? By my calculation 6 of 268 reads should be around 0.02 (and with that below the 8% threshold for calling a variant). Maybe I missed some parameter, which I need to modify.

zhengzhenxian commented 2 years ago

Hi,

When scanning the BAM alignment, Clair3 discarded secondary alignments, supplementary alignments and the reads with mapping quality < 5. And for high-coverage(>144 for pileup and >89 for full-alignment) candidates, we apply samtools subsampling on them, the output DP and AF tags were also calculated after filtering. Might you try to mpileup the position with following command to check whether any read is filtered:

samtools mpileup --max-depth 144 --excl-flags 2316 --min-MQ 5 --min-BQ 0 -r EFAU004_01163:102-103 ${BAM}
SebastianMeyer1989 commented 2 years ago

Thank you for your fast reply. I definitely have a lot of secondary/supplementary alignments in my setup, so this could really be it. I will check with your provided command. Is it possible, to tell Clair3 to not discard those secondary/supplementary alignments?

Edit: For the mentioned reference I have 11 "normal" alignments and 278 supplementary alignments. Since I am mapping large Nanopore reads against single genes (instead of a whole genome) and a lot of those genes are located on the same reads, there are many supplementary alignments. Which I want to keep and use for the variant calling.

zhengzhenxian commented 2 years ago

We excluded secondary/supplementary alignments using samtools flag. We have not exposed this option for users now, you might need to modify SAMTOOLS_VIEW_FILTER_FLAG from 2316 to 12 in two lines (https://github.com/HKU-BAL/Clair3/blob/main/shared/param_p.py#L39 and https://github.com/HKU-BAL/Clair3/blob/main/shared/param_f.py#L33) to include secondary/supplementary alignments.

Pls noted that the changes might affect the performance significantly so please check the outputs carefully.

SebastianMeyer1989 commented 2 years ago

Cool. This could be a good solution for my issue. And of course I will carefully check the output for strange behaviours.

I have a question to this changing of FLAGs: 2316 means unmapped reads, unmapped mates, supplementary alignments and secondary alignments are filtered out. 12 means only unmapped reads and unmapped mates are filtered out. If I wanted to keep the supplementary alignments, but not the secondary alignments, I should probably change it into 268 (instead of 12), if I calculated this correctly. Or would that interfere with anything else, regarding the programming of Clair3 (in comparison to changing the FLAG to 12)?

Thanks again for your fast and helpfull reply. :)

aquaskyline commented 2 years ago

Changing the FLAG won't mess up with Clair3, but it might slow down Clair3 a bit if there are plenty of supplementary alignments.