Closed sloth-eat-pudding closed 4 months ago
Hi longphase team. Thanks for asking. We spoke in more detail via email. ClairS is ready to make use of additional HP taggings in addition to the current HP1 and HP2. Basically, there is no limit to the number of HP categories ClairS can take. For parallelization, supporting range processing sounds good, ClairS will most likely use it in a per chromosome fashion.
We have released version 1.7.
Haplotag now includes the --region
feature.
The complete list of haplotag parameters
Usage: haplotag [OPTION] ... READSFILE
--help display this help and exit.
require arguments:
-s, --snp-file=NAME input SNP vcf file.
-b, --bam-file=NAME input bam file.
-r, --reference=NAME reference fasta.
optional arguments:
--tagSupplementary tag supplementary alignment. default:false
--sv-file=NAME input phased SV vcf file.
--mod-file=NAME input a modified VCF file (produced by longphase modcall and processed by longphase phase).
-q, --qualityThreshold=Num not tag alignment if the mapping quality less than threshold. default:1
-p, --percentageThreshold=Num the alignment will be tagged according to the haplotype corresponding to most alleles.
if the alignment has no obvious corresponding haplotype, it will not be tagged. default:0.6
-t, --threads=Num number of thread. default:1
-o, --out-prefix=NAME prefix of phasing result. default:result
--region=REGION tagging include only reads/variants overlapping those regions. default:(all regions)
--log an additional log file records the result of each read. default:false
@sloth-eat-pudding Glad to have the new release of LongPhase, our team will test it and get back to you.
ZX
In a https://github.com/HKU-BAL/ClairS/issues/18#issuecomment-1893671596 issue, it was mentioned that a "longer phaseset and an improved haplotagging ratio" were needed. Therefore, I attempted to incorporate indels for phasing and haplotagging.
Explanation of data sources
normal & tumor
: used only if the chromosome, position, and genotype are identical.
normal all
: uses all indels from the normal.snp.vcf | indel.vcf | phase | haplotag | Precision | Recall | F1-score | TP | FP | FN |
---|---|---|---|---|---|---|---|---|---|
ClairS | - | longphase v1.3 | whatshap v1.7 | 67.12% | 77.64% | 72.00% | 30626 | 15001 | 8821 |
ClairS | - | longphase v1.3 | longphase v1.3 | 67.27% | 77.62% | 72.07% | 30617 | 14897 | 8830 |
ClairS | - | longphase v1.7 | whatshap v1.7 | 67.27% | 77.62% | 72.07% | 30619 | 14899 | 8828 |
ClairS | indel (normal & tumor) | longphase v1.7-indel | whatshap v1.7 | 67.44% | 77.57% | 72.15% | 30599 | 14770 | 8848 |
ClairS | indel (normal & tumor) | longphase v1.7-indel | longphase v1.3(no tag indel) | 67.48% | 77.57% | 72.18% | 30601 | 14745 | 8846 |
ClairS | indel (normal & tumor) | longphase v1.7-indel | longphase v1.7(tag indel) | 67.75% | 77.52% | 72.31% | 30578 | 14553 | 8869 |
ClairS | indel (normal all) | longphase v1.7-indel | longphase v1.7(tag indel) | 68.80% | 77.44% | 72.87% | 30548 | 13853 | 8899 |
Would you be interested in trying to incorporate indels as well?
Yes, doing that in the next version.
Hello,
I am working with HCC1395 data, analyzing tumor samples at 75x coverage and normal samples at 45x coverage. I utilized Clair3 to process the normal.bam file, generating a normal.vcf. This file was then employed for phasing and haplotagging the tumor.bam, followed by using a somatic mutation caller. The results showed a notable decrease in false positives.
In an instance where false positives were converted to true negatives, it was observed that the mutations in the normal sample were heterozygous, whereas in the tumor sample, they were homozygous. This suggests a loss of heterozygosity (LOH) event, making the strategy of phasing and tagging most reads into the same haplotype seem correct. Have you considered this method?
Moreover, I noted in literature that the primary reason for choosing Longphase for phasing is its speed. We still have a speed advantage in haplotagging. ClairS employs parallel acceleration at the chromosome level and we can introduce a feature to specify a range. Could this reduce the training costs for you? I also conducted a haplotag test, and the results do not seem to show any significant differences.