Closed jianshu93 closed 1 year ago
Hello @jianshu93: no, Mutation rate using FracMinHash is not a metric.
Reasons: our theory (and code) computes mutation rate from containment index, not Jaccard index. Given two sets A and B, containment index C(A,B) does not necessarily equal to C(B,A). We tried to compute mutation rate from Jaccard in a similar way, but the mathematics simply gets too complicated.
For your information, this code has been implemented in sourmash(https://sourmash.readthedocs.io/), which now can compute ani (or mutation rate) using FracMinHash sketches. The software package also reports that it is not a metric in the latest version.
Another related question is, k-bottom sketch (https://dl.acm.org/doi/abs/10.1145/1281100.1281133) seems to be the reason used in Mash that leads to the set size bias. What if we stick to the original random permutation instead of using k-bottom sketch idea despite slower. The set size problem still exists right?
Thanks,
Jianshu
I am not sure I understand the question completely. Our theory (link) proves that the bias in FracMinHash exists even for a perfect hash function. The bias arises from the probability of ending up with an empty sketch. Does that answer your question?
Yes!this is why the bottom sketch is problematic!
Many thanks!
Jianshu
Dear FracMinHash team,
I am wondering the scaled sketch approach in FracMinHash is a metric operation, meaning, is FracMinHash (A, B) the same with FracMinHash (B, A), or FracMinHash (A, B) + FracMinHash (B, C) >= FracMinHash (A, C), that is the last 2 rules of metric space: https://en.wikipedia.org/wiki/Metric_space. Since MinHash is unbiased estimation of Jaccard index (a metric) if no big difference in set size (as in genomes, we require a high completeness et.al.), it is a metric. But I am not sure whether FracMinHash estimated mutation rate is also a metric distance.
Many Thanks,
Jianshu