HKU-BAL / ClairS

ClairS - a deep-learning method for long-read somatic small variant calling
BSD 3-Clause "New" or "Revised" License
66 stars 7 forks source link

Questions Regarding Heterozygous Variants, Somatic Mutations, and Phasing in ClairS Usage #18

Closed sloth-eat-pudding closed 1 week ago

sloth-eat-pudding commented 6 months ago

Dear Clair Team,

I am a member of the longphase development team. Recently, while using ClairS and studying related literature, I have observed some results that raised a few questions I'd like to inquire about.

  1. During the use of ont_quick_demo.sh, I used IGV to observe the vcf in [Figure 1] process.After organizing the data, I noticed that only the tumor samples were heterozygous (as shown in Figure 2). My question is, during the Germline variant calling step, are only the tumor cells identified as confident heterozygous?

  2. From Figure 2, I observed that the SNPs used for longphase actually include some somatic mutations. I would like to confirm if the SNPs used for phasing are indeed a mix of somatic mutations and Germline variants.

  3. In Figure 3, the number of SNPs used for phasing seems insufficient to cover the entire range. Attempting other variant calling software, I found that Clair3 (v1.0.5) results could cover a broader range. After phasing and haplotagging with Clair3 and longphase (v1.6), I obtained the results as shown in Figure 4 (B) (only displaying SNPs that were phased). I noticed that different haplotypes could still be distinguished in ClairS's output (as seen in Figure 5). Have you considered using this method?

  4. For ClairS, tagging appears to be a crucial step. The ideal output would include H1/H2 and H2 (carrying somatic mutations), or H1, H2, H3 (tumor-specific), etc. Would such an approach be beneficial for the training or detection of the ClairS model? If you think it would be helpful, we are considering developing a new version focused on somatic mutations.

    Figure 1 f1

    Figure 2 f2

    Figure 3 f3

    Figure 4 f4

    Figure 5 f5

Thank you for your time and assistance!

zhengzhenxian commented 6 months ago

Hi, @sloth-eat-pudding,

Thank you for your interest in ClairS.

  1. Our design is to select heterozygous SNPs from both normal and tumor samples for phasing and haplotagging. These signals are more likely to represent true germline heterozygous SNPs rather than somatic mutations.

  2. We should only use heterozygous germline SNPs variants for phasing. In some cases, some somatic variants may have similar patterns to germline variants(might be due to the quality or high normal AF), which were identified as germline by Clair3. We are also actively working on excluding the rest of them from the phasing process to avoid confusing the model.

  3. It would be highly beneficial to have a longer phaseset and an improved haplotagging ratio. We believe that having more phased alignments will significantly enhance performance. Any hints on parameter settings to have an improved phasing performance? Thanks!

  4. Currently, we categorize the haplotypes into germline H1 and H2 only. However, it would be beneficial to include somatic haplotypes (H3) as well. However, linking distant somatic variants to obtain somatic haplotypes would be challenging. After analyzing the data, we have observed that the distance between two somatic variants can range from 10k to 100k, which presents a challenge in acquiring somatic haplotypes even in ONT reads.

Look forward to having a new LongPhase version for somatic variant calling!

Zhenxian

sloth-eat-pudding commented 6 months ago

I apologize for not being clear earlier.

Q1. I searched for confident heterozygous germline variants identified by ClairS in the normal.bam file, but found that they actually contain homozygous variants.(as shown in Figure 2).

Q3. This is our development goal.Additionally, I noticed that Clair3's Makefile uses LongPhase v1.3. Our current version is v1.6, and I suggest upgrading to this version for improvements in accuracy and processing time.

zhengzhenxian commented 6 months ago

For Q1, seems there are no homozygous variants in normal BAM in Figure 2, are you referring to a homozygous reference(that is the same allele as the reference base)? But thanks for reporting this, we will check the details then.

For Q3, thanks for the suggestion, sure, we will update LongPhase to v1.6 in our next release.

sloth-eat-pudding commented 6 months ago

Q1. Your understanding is correct. Thank you for your confirmation.