Kingsford-Group / omhismb2019

23 stars 8 forks source link

is omh metric? #4

Open jianshu93 opened 1 year ago

jianshu93 commented 1 year ago

Hello Team,

several questions related to omh and edit distance:

  1. For edit distance in the edlib (https://github.com/Martinsos/edlib), there can be global, infix and prefix method (e.g., gaps opened at beginning or end of either sequences are not penalized or penalized, semi-global or global). For its sensitive hash, I think it should be global (gaps opened at beginning or end of either sequences are penalized) because all kmers of 2 sequences are considered.
  2. weighted jaccard are metric (https://en.wikipedia.org/wiki/Metric_space) so distance approximated by its LSH will also be metric. Is OMH also metric?

Thanks,

Jianshu

gmarcais commented 1 year ago

Hi Jianshu,

thank you for your questions regarding OMH.

  1. That's correct. OMH is a LSH sensitive (with a gap) to the (global) edit distance, not to a local alignment like Smith-Waterman.
  2. There are different LSH at play here. MinHash is an unbiased estimator of the Jaccard, in other words it is a gap-less LSH (LSH according to Charikar's definition). OMH is a gaped LSH (following Indik's definition): it does not directly estimate the edit distance. Rather, it can distinguish with high-probability poorly aligning sequence pairs from well aligning sequence pairs. The question whether OMH is metric is not really valid. It is not a measure at all.

Cheers

jianshu93 commented 1 year ago

Hello Guillaume,

Thanks for the quick response during holiday! For the second question, my question then would be the distance/similarity estimated by OMH (the value calculated by omh_compute between 2 sequences, there must be a value telling whether high-probability poorly aligning sequence pairs and well aligning sequence pairs can be differentiated, like Figure 4 in the paper, correct me If I am understanding the paper in a wrong way) metric? Thanks for pointing out the indik's definition. The reason I am asking is also related to nearest neighbor search, but not based on LSH, but based on tree-like or graph like space partitioning algorithms, which requires that the distance used must be metric. I want to feed the output from omh_compute (a distance) to graph or tree like algorithms like this one here, which requires the distance used must be metric (or at least close).

Thanks,

Jianshu

jianshu93 commented 1 year ago

Question on to what edit dissimilarity between 2 sequences, order Min Hash is still accurate. I can see that for 0.45 edit dissimilarity, it is still accurate and can differentiate those 2 sequences. Another question is, if 2 sequences are not the same length, say one is only 1/5 of another, then will order min hash can still accurately accproxiamte so called, semi-global alignment identity?

Thanks,

Jianshu