HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
247 stars 27 forks source link

An idea to reduce FPs for calls near the end of homopolymer #244

Closed ymcki closed 11 months ago

ymcki commented 1 year ago

Dear Clair3 team,

 As a heavy user of ONT's wf-human-variation pipeline, I am also indirectly a heavy user of Clair3.

I noticed that there are quite many false positive variant calls (mostly single base indels) near the end of homopolymer which gave me quite many distractions when looking for disease causing variants. I understand that this is mostly due to the nature of nanopore technology but it would be great if Clair3 can also do something about it.

 Since the wf-human-variation pipeline also generates a haplotagged.bam from whatshap (ie

call the phase of an aligned read by looking at heterozygous variant calls), I am able to visualize what's going on for variant calls near the end of a homopolymer. With the help of phasing info from whatshap, I think I can pinpoint most of the false positive calls intuitively and significantly reduce the false positive rate.

 I am thinking maybe it is possible for Clair3 to take this haplotagged bam as input and remove the

wrong calls near the end of homopolymer. It would be great if a fixed bam file is also outputted as well for visualization in IGV.

 In the long run, perhaps Clair3 can also do the phasing itself and correct these errors in one

go?

Thank you very much for your time.

aquaskyline commented 1 year ago

It's a double edge sword. While during eye checking, some FP can be easily caught and be corrected (FP->TN), implementing the same rule as a filter that applies to all variants could also cause a considerable amount of TP switching to FN.

ymcki commented 1 year ago

I think FPs caused by homopolymer are likely to be distributed to the two haplotypes evenly whereas true mutations should only concentrate in one of the two haplotypes.

Therefore, I think a simple statistical test should be able to distinguish the two when coverage is high enough.

Of course, if you want to go the deep learning way, you can also use the eyes to generate a truth set to train on.