dkoslicki / CMash

Fast and accurate set similarity estimation via containment min hash
BSD 3-Clause "New" or "Revised" License
42 stars 9 forks source link

Is CMash metric? #32

Open jianshu93 opened 1 year ago

jianshu93 commented 1 year ago

Dear CMash team,

I am wondering the CMash is a metric, meaning, is CMash (A, B) the same with CMash (B, A), or CMash (A, B) + CMash (B, C) >= CMash (A, C), that is the last 2 rules of metric space: https://en.wikipedia.org/wiki/Metric_space. Since MinHash is unbiased estimation of Jaccard index (a metric), it is a metric. But I am not sure whether CMash estimated mutation rate is also a metric distance.

Many Thanks,

Jianshu

ShaopengLiu1 commented 1 year ago

Hi Jianshu,

The Jaccard index is symmetric: J(A, B) = J(B, A). However, I don't think the Jaccard index fits the 2nd rule. Here is a counter-example:

  1. let set A = set C, and set B has NO overlap with set A or C
  2. now J(A,B) + J(B,C) = 0 since there is no overlapped element
  3. but J(A, C) = 1

For the containment index we estimated in CMash, it's NOT symmetric by definition: C(A, B) = containment of set A in set B = $\frac{A \cap B}{B}$ C(B, A) = containment of set B in set A = $\frac{A \cap B}{A}$

The Jaccard index can be used for distance measurement. Please note that the Jaccard index J(A, B) and the containment index C(A, B) or C(B, A) are interchangeable if we know the set size of A and B.

jianshu93 commented 1 year ago

Hello Shaopeng,

Yes Jaccard index itself is not metric, sorry for the bad description, Jaccard distance, or 1-Jaccard index, is a proper metric. So my question would be will 1-CMash distance/index a metric? Again still the mentioned 2 rules considered. According to you description, it is not symmetric, therefore not metric. Just to confirm. With the metric property, there many other aspects, e.g., nearest neighbor search algorithms, can be applied/combined with the Minhash like algorithms for even faster search and retrieval of similar information in genomic database.

Thanks,

Jianshu

jianshu93 commented 1 year ago

Hello Shaopeng,

A quick question on how CMash estimate the intersection of kmer in two genomes A and B, especially for large number of genome pairs. Also the number of unique kmers (cardinality) in set A and B is calculated for large number of genomes. Yes I agree that when set size A and B differ by quite a lot, using C(A,B) to calculated J(A,B) is more robust than classic MinHash, but the estimation of intersection of A and B, or cardinality of A and B based on sketching algorithms is never easy for large scale applications. New HyperLogLog estimators from Ertl 2017 (especially the MLE methods), is thought to be the most accurate estimator of cardinality and also intersections or unions. What is the RMSE of JS based on JC, assuming the Flajolet 2007 hyperloglog is used, compare to MinHash, especially for small Jaccard? MinHash is sqrt(J(1-J)/m), while cardinality estimator (Flajolet HLL) itself is already sqrt(1.079/m), assuming using the same registers m when hashing. For small Jaccard like 0.01, MinHash is 100 times more accurate assuming intersection is perfect (which is not and even worse if use cardinality from Flajolet HLL of A and B for intersection). MinHash is very popular because its very small variance, this is important because small jaccard like 0.01 or 0.05 correspond to, 79% to 85% mutation rate (ANI, see Mash distance equation), among which, JS from JC will be significantly less accurate than Minhash. At the same time, we compare genomes only when they are complete, meaning they do not differ by size more than 10% (based on the quality of genomes), otherwise the genome is not trusted.

Thanks,

Jianshu

ShaopengLiu1 commented 1 year ago

Hi Jianshu,

Sorry for my late response.

To your 1st post: I don't think containment index OR 1 - containment index is appropriate to be used for distance matrix. You can try to transfer them to ANI, which might be better to fit this idea. For example:

  1. use Mash distance
  2. use FastANI

To your 2nd post: We use hyperloglog to estimate cardinality.

The way to estimate intersection is similar to Fig 2 in the Mash Screen paper. It's called "streaming method". Here is a brief workflow:

  1. build MinHash sketches for all reference genomes
  2. store those sketches into a ternary search tree (like Fig 2E, you get a union of k-mers in all sketches)
  3. loop through all k-mers from the query data against the ternary search tree (or a union table)
  4. (optional) you can accerlerate step3 by using a bloom filter to exclude k-mers that are NOT in the union table.
  5. after the looping, you get all containment hits together.

Shaopeng