HKU-BAL / ClairS

ClairS - a deep-learning method for long-read somatic small variant calling
BSD 3-Clause "New" or "Revised" License
67 stars 7 forks source link

Enhancing somatic variant calling and execution speed #22

Closed sloth-eat-pudding closed 1 month ago

sloth-eat-pudding commented 5 months ago

Hello,

I am working with HCC1395 data, analyzing tumor samples at 75x coverage and normal samples at 45x coverage. I utilized Clair3 to process the normal.bam file, generating a normal.vcf. This file was then employed for phasing and haplotagging the tumor.bam, followed by using a somatic mutation caller. The results showed a notable decrease in false positives.

phase and haplotag Precision Recall F1-score TP FP FN
ClairS germline.vcf 67.12% 77.64% 72.00% 30626 15001 8821
Clair3 normal.vcf 72.50% 77.46% 74.90% 30556 11593 8891

In an instance where false positives were converted to true negatives, it was observed that the mutations in the normal sample were heterozygous, whereas in the tumor sample, they were homozygous. This suggests a loss of heterozygosity (LOH) event, making the strategy of phasing and tagging most reads into the same haplotype seem correct. Have you considered this method?

image

Moreover, I noted in literature that the primary reason for choosing Longphase for phasing is its speed. We still have a speed advantage in haplotagging. ClairS employs parallel acceleration at the chromosome level and we can introduce a feature to specify a range. Could this reduce the training costs for you? I also conducted a haplotag test, and the results do not seem to show any significant differences.

haplotag Precision Recall F1-score TP FP FN
whatshap v1.7 67.12% 77.64% 72.00% 30626 15001 8821
longphase v1.3 67.27% 77.62% 72.07% 30617 14897 8830
aquaskyline commented 5 months ago

Hi longphase team. Thanks for asking. We spoke in more detail via email. ClairS is ready to make use of additional HP taggings in addition to the current HP1 and HP2. Basically, there is no limit to the number of HP categories ClairS can take. For parallelization, supporting range processing sounds good, ClairS will most likely use it in a per chromosome fashion.

sloth-eat-pudding commented 4 months ago

We have released version 1.7. Haplotag now includes the --region feature.

The complete list of haplotag parameters

Usage:  haplotag [OPTION] ... READSFILE
      --help                          display this help and exit.

require arguments:
      -s, --snp-file=NAME             input SNP vcf file.
      -b, --bam-file=NAME             input bam file.
      -r, --reference=NAME            reference fasta.
optional arguments:
      --tagSupplementary              tag supplementary alignment. default:false
      --sv-file=NAME                  input phased SV vcf file.
      --mod-file=NAME                 input a modified VCF file (produced by longphase modcall and processed by longphase phase).
      -q, --qualityThreshold=Num      not tag alignment if the mapping quality less than threshold. default:1
      -p, --percentageThreshold=Num   the alignment will be tagged according to the haplotype corresponding to most alleles.
                                      if the alignment has no obvious corresponding haplotype, it will not be tagged. default:0.6
      -t, --threads=Num               number of thread. default:1
      -o, --out-prefix=NAME           prefix of phasing result. default:result
      --region=REGION                 tagging include only reads/variants overlapping those regions. default:(all regions)
      --log                           an additional log file records the result of each read. default:false
zhengzhenxian commented 4 months ago

@sloth-eat-pudding Glad to have the new release of LongPhase, our team will test it and get back to you.

ZX

sloth-eat-pudding commented 4 months ago

In a https://github.com/HKU-BAL/ClairS/issues/18#issuecomment-1893671596 issue, it was mentioned that a "longer phaseset and an improved haplotagging ratio" were needed. Therefore, I attempted to incorporate indels for phasing and haplotagging.

Explanation of data sources

snp.vcf indel.vcf phase haplotag Precision Recall F1-score TP FP FN
ClairS - longphase v1.3 whatshap v1.7 67.12% 77.64% 72.00% 30626 15001 8821
ClairS - longphase v1.3 longphase v1.3 67.27% 77.62% 72.07% 30617 14897 8830
ClairS - longphase v1.7 whatshap v1.7 67.27% 77.62% 72.07% 30619 14899 8828
ClairS indel (normal & tumor) longphase v1.7-indel whatshap v1.7 67.44% 77.57% 72.15% 30599 14770 8848
ClairS indel (normal & tumor) longphase v1.7-indel longphase v1.3(no tag indel) 67.48% 77.57% 72.18% 30601 14745 8846
ClairS indel (normal & tumor) longphase v1.7-indel longphase v1.7(tag indel) 67.75% 77.52% 72.31% 30578 14553 8869
ClairS indel (normal all) longphase v1.7-indel longphase v1.7(tag indel) 68.80% 77.44% 72.87% 30548 13853 8899

Would you be interested in trying to incorporate indels as well?

aquaskyline commented 4 months ago

Yes, doing that in the next version.