RabbitBio / RabbitTClust

RabbitTClust: enabling fast clustering analysis of millions bacteria genomes with MinHash sketches
Other
39 stars 3 forks source link

Discrepancies Found between RabbitTClust and NCBI Clustering Result #12

Open amyliufda opened 3 months ago

amyliufda commented 3 months ago

In recent runs against the latest NCBI dataset of Listeria, we've observed large discrepancies between RabbitTClust and NCBI clustering results. Here're a few examples.

  1. When distance threshold < 0.0003, SRR2051098 is clustered with 34 other isolates in NCBI result, https://www.ncbi.nlm.nih.gov/pathogens/tree/#Listeria/PDG000000001.3630/PDS000003342.11?accessions=PDT000066179.2, while RabbitTClust doesn't cluster with the 34 isolates but with some other isolates that are mostly from another NCBI cluster.
  2. When distance threshold < 0.0003, SRR4416146 is clustered with 18 other isolates in NCBI result, https://www.ncbi.nlm.nih.gov/pathogens/tree/#Listeria/PDG000000001.3630/PDS000003335.20?accessions=PDT000151961.2, while in RabbitTClust, it's all by itself without clustering with any other isolates.
  3. When distance threshold >= 0.0003, SRR2051098 is in a big cluster with thousands of other isolates, and SRR4416146 is still by itself.

We understand that different thresholds produce different results, however, seeing such big differences between NCBI and RabbitTClust is not what we have expected. SRR2051098 and SRR4416146 are just two random examples, and there're others like them, too. Is there an explanation why RabbitTClust results are so much off from the NCBI results? Thank you.

XiaomingXu1995 commented 3 weeks ago

Sorry for the delayed response.

You mentioned that the distance threshold is about 0.0003, indicating that the genomes used for clustering have high similarities (ANIs) for all pairs.

RabbitTClust measures the distance and similarity between genomes using the MinHash sketch strategy. The MinHash algorithm is a type of Locality-Sensitive Hashing algorithm used to estimate distances between genomes. Besides the distance threshold, the k-mer size and sketch size (options -k and -s) also impact the clustering results. When genomes used for clustering have high similarities, The distinction between clusters may be affected by the estimation errors inherent in the MinHash algorithm. Increasing the sketch size can mitigate this issue but will result in longer runtime.

In contrast, FastANI claims to achieve higher accuracy when dealing with genomes that have high similarities across all pairs compared to the MinHash algorithm. If increasing the sketch size does not significantly improve your results, you might want to try FastANI.

Best, Xiaoming

amyliufda commented 3 weeks ago

Thank you, Xiaoming, for your response. We did explorer different k-mer sizes and sketch sizes, but didn't have a close result, we may explore FastANI next as you suggested. Thanks!