HKU-BAL / ClairS

ClairS - a deep-learning method for long-read somatic small variant calling
BSD 3-Clause "New" or "Revised" License
67 stars 7 forks source link

Nondeterminism in ClairS output #11

Closed xingyaoc closed 1 year ago

xingyaoc commented 1 year ago

This has already been resolved by I am sharing the email thread with @zhengzhenxian about this issue here.

ClairS returns slightly different results depending on the order in which bam files are being merged with samtools merge. While samtools merge does return merged bams with dissimilar ordering, I do not think it should affect variant calling. Could you help me understand a). how these functionally identical bam files are causing differences in ClairS output, and b). how significant are these differences?

Reply: Thank you for providing us with the data. Based on the data you provided, we believe that the issue is due to some of the merged BAMs sharing the same read names(read name and count are shown in the figure). We used "samtools mpileup" to parse all input alignments, and samtools gave different pileup results for different merged ordering, resulting in different variant calling results in Clair3 and ClairS.

Unfortunately, we are unable to fix the issue as it falls under the logic of samtools. However, we have two suggestions that may be helpful. The first is to add a unique prefix for each BAM that needs to be merged. The second is to sort the BAM file names in a specific order, such as alphabetical order, before merging. In our testing, if all read names are unique, the calling gave the same result in multiple runs.

Based on your benchmarking data, nearly 50% of the read names have more than one read support, which has contributed to 34 different calls. We suggest implementing one of the suggestions above to ensure that the merged BAMs have unique read names and therefore produce consistent variant calling results.

Please let us know if you have any further questions or concerns.

Thanks for your quick response to my issue. My team and I are taking your advice to try to force deterministic ClairS outputs in our pipeline. Taking your first suggestion, my team has filtered out non-unique reads and verified with BamUtil diff that two separate bam files are functionally identical -- multiple reads mapped to the same position are still ordered differently. However, calling ClairS on these two bam files again returned different results. Also, I was not able to detect any differences between the pileup files from either the output temp files or my own mpileup outputs. Can you please advise us on the source of the nondeterminism? We do believe that sorting the bam files before merging would work and we are in the process of implementing it in our pipeline, but my team would like to understand the root cause so that we can properly assess the validity of our results.

Regarding the issue of nondeterminism, I would like to clarify that it stems from samtools itself. Even after removing duplicated reads, we still observe differences in the read disorder. To investigate this further, I examined the Clair3 results in ${OUTPUT_DIR}/tmp/clair3_output/clair3_tumor_output/merge_output.vcf.gz and compared them with the samtools mplieup result of the first different record (chr1:143185622). Unfortunately, I found that the output is still different due to the different read order(figure 1 and figure2).

The issue with the read order affects the ClairS VCF result because we encode the alignment into a spatial read-level representation in our full-alignment model. Different read orders lead to different input arrangements and thus giving different outputs for some variants.

To address this issue, I would still recommend fixing the BAM order in the merging step to avoid distinct read orders. Please feel free to let us know if you have any further questions.

zhengzhenxian commented 1 year ago

@xingyaoc,

Thanks for reporting the issue! We confirmed that the non-determinism itself was from samtools merge. The issue happens when the order of input BAMs to samtools merge changes. Please try using the same order of input BAM filenames.

Zhenxian