Is FracMinHash a metric

KoslickiLab / mutation-rate-ci-calculator

This software calculates a confidence interval for the mutation rate from a set of observed containment indices under a simple nucleotide mutation process.

Other

7 stars 1 forks source link

Is FracMinHash a metric #12

Closed jianshu93 closed 1 year ago

jianshu93 commented 1 year ago

Dear FracMinHash team,

I am wondering the scaled sketch approach in FracMinHash is a metric operation, meaning, is FracMinHash (A, B) the same with FracMinHash (B, A), or FracMinHash (A, B) + FracMinHash (B, C) >= FracMinHash (A, C), that is the last 2 rules of metric space: https://en.wikipedia.org/wiki/Metric_space. Since MinHash is unbiased estimation of Jaccard index (a metric) if no big difference in set size (as in genomes, we require a high completeness et.al.), it is a metric. But I am not sure whether FracMinHash estimated mutation rate is also a metric distance.

Many Thanks,

Jianshu

mahmudhera commented 1 year ago

Hello @jianshu93: no, Mutation rate using FracMinHash is not a metric.

Reasons: our theory (and code) computes mutation rate from containment index, not Jaccard index. Given two sets A and B, containment index C(A,B) does not necessarily equal to C(B,A). We tried to compute mutation rate from Jaccard in a similar way, but the mathematics simply gets too complicated.

For your information, this code has been implemented in sourmash(https://sourmash.readthedocs.io/), which now can compute ani (or mutation rate) using FracMinHash sketches. The software package also reports that it is not a metric in the latest version.

jianshu93 commented 1 year ago

Another related question is, k-bottom sketch (https://dl.acm.org/doi/abs/10.1145/1281100.1281133) seems to be the reason used in Mash that leads to the set size bias. What if we stick to the original random permutation instead of using k-bottom sketch idea despite slower. The set size problem still exists right?

Thanks,

Jianshu

mahmudhera commented 1 year ago

I am not sure I understand the question completely. Our theory (link) proves that the bias in FracMinHash exists even for a perfect hash function. The bias arises from the probability of ending up with an empty sketch. Does that answer your question?

jianshu93 commented 1 year ago

Yes！this is why the bottom sketch is problematic！

Many thanks！

Jianshu