Open amyliufda opened 8 months ago
Sorry for the delayed response.
You mentioned that the distance threshold is about 0.0003, indicating that the genomes used for clustering have high similarities (ANIs) for all pairs.
RabbitTClust measures the distance and similarity between genomes using the MinHash sketch strategy. The MinHash algorithm is a type of Locality-Sensitive Hashing algorithm used to estimate distances between genomes. Besides the distance threshold, the k-mer size and sketch size (options -k and -s) also impact the clustering results. When genomes used for clustering have high similarities, The distinction between clusters may be affected by the estimation errors inherent in the MinHash algorithm. Increasing the sketch size can mitigate this issue but will result in longer runtime.
In contrast, FastANI claims to achieve higher accuracy when dealing with genomes that have high similarities across all pairs compared to the MinHash algorithm. If increasing the sketch size does not significantly improve your results, you might want to try FastANI.
Best, Xiaoming
Thank you, Xiaoming, for your response. We did explorer different k-mer sizes and sketch sizes, but didn't have a close result, we may explore FastANI next as you suggested. Thanks!
In recent runs against the latest NCBI dataset of Listeria, we've observed large discrepancies between RabbitTClust and NCBI clustering results. Here're a few examples.
We understand that different thresholds produce different results, however, seeing such big differences between NCBI and RabbitTClust is not what we have expected. SRR2051098 and SRR4416146 are just two random examples, and there're others like them, too. Is there an explanation why RabbitTClust results are so much off from the NCBI results? Thank you.