LucaCappelletti94 / minhash-rs

A Rust implementation of MinHash trying to be parsimonious with memory.
MIT License
5 stars 0 forks source link

any plans to continue development? #1

Closed chris-ha458 closed 2 weeks ago

chris-ha458 commented 1 year ago

This crate looked very interesting but

After days of benchmarking, it seems the results are in and indeed MinHash does not seem to be ever preferable to HyperLogLog at a comparable memory requirement. This leads me to believe that you are not planning to continue development. is this true?

LucaCappelletti94 commented 1 year ago

Hi @chris-ha458, as you might have seen in the more recent update, I am now and then including some new features.

Minhash is not superior to hyperloglog for the use cases I was interested in. I implemented this and the other crate to evaluate which of the two approaches might have yielded the best performance when implemented as nicely as I and friends of mine possibly could.

Of course, MinHash still has its applications, such as detecting similar items. I was asked by a friend to add some methods to make that faster, so I did - do you happen to have some feature requests in mind?

chris-ha458 commented 1 year ago

I am interested in text deduplication and If i understand correctly hyperloglog is not particularly suitable for this purpose (It might be useful as a first pass so one could understand exact/minhash scaling requirements but I havent heard it being used as such)

I see that you use xorshift. I am not surprised. Minhash or other datasketches require a lot of independent / universal / perfect hashes (name depends on the paper or implementation you look at) and xorshift seems sound.

Recently however, double hashing caught my attention. This takes two hashes as a primitive (one could be a different keyed version of the other) and multiply, adds.

Pseudocode for atomic.rs

for i in (0..PERMUTATIONS) {
hash1 = hash1 + i * hash2
hash1
}

Some changes would be neccesary such as wrapping_add or making the construction an iterable to match the construction etc but we don't need all the xorshift related code here.

Of course if there is anyreason that you'd like to keep the current xorshift(mathematical distribution, speed, familiarity) that's up for you to decide. but if you'd like this version I might write up some code.

Obviously, my interest in this codebase does not stop at a specific hash choice, but because I'd like to extend it into something useful for text-dedup.

I am already using the python based https://github.com/ChenghaoMou/text-dedup or the rust/python based https://github.com/allenai/dolma and they both lack a rust based minhash implementation. My hope was that maybe this could fill that void with some development.

jianshu93 commented 2 weeks ago

I believe probminhash is what you need, there are traditional MinHash (bottom-K), one permutation MinHash with optimal densification, probminhash, superminhash, setsketch, everything is there.

Best,

Jianshu

LucaCappelletti94 commented 2 weeks ago

I am closing this issue as I am failing to understand the request - the implementation of MinHash is, from what I understand, decently complete. If you have specific requirements, please open an issue detailing what you want to add. Happy to discuss them.

MinHash and HyperLogLog treat documents as a bag of words, so I do not expect MinHash to outperform HLL++, primarily as it does not include the various corrections that HLL++ et similia do. More significantly, MinHash is much more compute-intensive than HLL.