chrisjmccormick / MinHash

Example Python code for comparing documents using MinHash
MIT License
247 stars 91 forks source link

Did you create code for calculating all the similarity scores with the Min Hash approach? #4

Open SpaceDustPi opened 5 years ago

SpaceDustPi commented 5 years ago

Chis,

Thank you for the tutorial and code. I found yours to be one of the more understandable explanations. MinHashing seems extremely clever.

In your runHashMinExample, you commented that you used the direct calculation for the similarities, which I have no problems understanding. However, I have been searching for this "MinHash approach approach" to creating similarities. I was wondering if you had written this for if you can point me in the right direction. I have a huge dataset and I would like to use MinHash approach.

I am assuming that this would be faster. I am also assuming that the storage code (for the triangle matrix) would remain the same.

Thank you, Ben

chrisjmccormick commented 5 years ago

Hey Ben, Thanks, really glad it was helpful!

The code for actually comparing the MinHash signatures is lines 329 to 362. It's written as a simple double for-loop, though, so there are probably faster implementations out there.

Roughly how big is your dataset? Millions of signatures? 10s of millions?

Thanks, Chris

SpaceDustPi commented 5 years ago

Thanks, Chris, I was actually able to figure that out. Sorry for the bother.

Btw, I was reading about LSH on an engineering blog about using different hashing techniques in place of the random hashing functions. I haven't gone through it yet, but from what I understand, this technique does not require a dictionary, and we don't need to iterate. I would love to know what you think: https://medium.com/engineering-brainly/locality-sensitive-hashing-explained-304eb39291e4 It's under the section called "Hash part of MinHash".

Also, I was blown away by your use of the triangle matrix. I took linear algebra a long time ago and never really applied it. Upon reviewing, I saw that triangle matrices are easy to compute, so I am guessing you leveraged that property to save on computational costs.

Thank you, Ben

SpaceDustPi commented 5 years ago

Forgot to answer your question. My data set is millions, but I was planning to used bits and pieces, unless you think this could handle millions. I will probably introduce generator objects to it and what not.

SpaceDustPi commented 5 years ago

Sorry for all the messages, but I saw that you are only using triangle matrices to store. I'm a little hazy on how or what it make this work for occupying less space. If you know of any resources that would help me understand, please point me in the right direction.

Thank you.

chrisjmccormick commented 5 years ago

Hi Ben, Hey! Sorry, I’ve been wanting to respond to this, but I’ve been busy!

When comparing every signature to every other signature, if you compute the full matrix you’re going to calculate everything twice (a vs. b, then b vs. a), plus you'll compare every item to itself. The idea of the triangle matrix is that you only do each comparison once, and don't compare things to themself. This cuts your memory consumption and compute time roughly in half.

Calculating the MinHash signatures for 1M items shouldn't be too bad, but doing the pairwise comparisons is going to be a lot of compute. It's roughly 1M^2 / 2, which is 500 billion comparisons. Even if one comparison takes 10us, that's still over 50 days of compute!

I wanted to tell you, our company Nearist has hardware designed specifically for accelerating large distance calculations, and it supports the hamming distance. Would you be interested in running your example on our hardware? It’s still in beta so it'd be good to get the feedback.

Thanks! Chris

SpaceDustPi commented 5 years ago

Chris,

No worries. I was in Spain eating tapas. I really enjoyed breaking down your code. What I failed to realize until today was that your code was a tutorial on MinHash, and not Locality Sensitive Hashing. Good lord, I once was lost and now I have been found.

Interesting that the triangle matrix facilitates "one time" comparison and not "twice". I actually had no idea. Rather, I thought it saved space since a rectangular triangle would take up more space (generally speaking as a sparse matrix), which I suppose could still be saving space with my dataset, but this aspect is not nearly as helpful as the "one time" comparison operation. I'll have to digest this a little more to see how it facilitates this "one time" comparison.

For about 20K sentences, the your code gets hung up and never completes. I think it should throw a memory error, but it seems to get stuck somewhere.

Sure, I could give it a go with your beta version. However, the data I am using is sensitive so I would have to do it offline. I imagine this version uses LSH?

Thanks, Ben