ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://ekzhu.github.io/datasketch
MIT License
2.51k stars 294 forks source link

Merging (Identically Specified) MinHashLSH objects #205

Closed hsicsa closed 6 months ago

hsicsa commented 1 year ago

Background:

When using a dataflow pipeline, the input stream (e.g. of documents) can be split to increase parallelism & throughput across multiple machines. In order to do so, when computing the MinHashLSH of the stream, the MinHashLSH objects in two or more sub-streams must be merged.

Generally speaking, LSH implementations do not support merging two or more LSH states into a single LSH state.

However, MinHashLSH (as implemented in datasketch) does appear to be capable of doing so when all the parameters of the objects to be merged are identical. Beyond the threshold, # permutations of the minHash values, ad weights, when the number of hash tables and the associated set of hashvalue ranges are identical, and it seems reasonable to support the merging of the underlying hash tables.

Proposal:

Create a mergeMinHashLSH function that does the following:

Additionally, a mergeMinHashLSH method could be added to the MinHashLSH class to perform merging in place.

ekzhu commented 1 year ago

Thank! I think this is a good idea. It will be a useful classmethod for MinHashLSH class.

ekzhu commented 1 year ago

Are you interested in submitting a PR for this?

hsicsa commented 1 year ago

I'm interested but won't be able to do so until mid-April at the earliest.

ekzhu commented 1 year ago

Not urgent. I label this issue as "help wanted" in case anyone want to jump in before that.

On Fri, Mar 24, 2023 at 12:35 PM hsicsa @.***> wrote:

I'm interested but won't be able to do so until mid-April at the earliest.

— Reply to this email directly, view it on GitHub https://github.com/ekzhu/datasketch/issues/205#issuecomment-1483315397, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOGLWW4TQP3T2DKDYD7IDW5XZPTANCNFSM6AAAAAAWCRUSTQ . You are receiving this because you commented.Message ID: @.***>

rupeshkumaar commented 10 months ago

Hi @ekzhu. Let me understand this when @hsicsa suggesting to merge identical MinHashLSH objects, it is combining all the keys from similar objects into one object and if the same key appears in multiple objects then in that case should it be merged?

ekzhu commented 9 months ago

Sorry I missed this... I am not quite following can you elaborate?

I believe the goal is to simply merge the hash tables of a list of MinHashLSH objects. Of course, I would also have a procedure to make sure the sets of keys of the MinHashLSH objects are disjoint, aka., the keys were partitioned when creating the list of MinHashLSH objects in the first place.

rupeshkumaar commented 8 months ago

I have taken a look at it, I wanted to understand what initialization parameters are we going to compare. I have implemented the eq and compared the type, threshold, and num_perms do I need to compare any other init parameters too? @ekzhu

ekzhu commented 8 months ago

You probably want to have an option to check if the keys stored in the hashtables are disjoint before merging them. It will be an expensive operation, so give the caller an option to choose.

rupeshkumaar commented 8 months ago

Okay so let me understand. There are actually two keys. One that is given by the user while inserting MinHash into the MinHashLSH object along with the MinHash object and the other key is Hashed one which is getting created by byteswap. Please can you tell me which one you are referring to? And the key can be found in the hashtables in the DictSetStorage. Now, I have already added a check where it checks for the key present in second MinHashLSH object should not be in the first MinHashLSH. if the key exist it will throw an appropriate exception but do I need to merge the keys if user wants it to be? Is that what you are referring to? I can submit a PR in sometime for you to look into.

ekzhu commented 8 months ago

Okay so let me understand. There are actually two keys. One that is given by the user while inserting MinHash into the MinHashLSH object along with the MinHash object and the other key is Hashed one which is getting created by byteswap. Please can you tell me which one you are referring to?

The user provided key for each MinHash object.

if the key exist it will throw an appropriate exception but do I need to merge the keys if user wants it to be?

Going back a bit. The scenario that a merge could go wrong is when the two MinHashLSH are created with different key --> minhash mappings. But in the subcase when the keys --> minhash mappings are different but the keys are disjoint, the user can still rely on the retrieved keys to decide on whether to filter it out -- it won't affect the recall. But if the keys are not disjoint and the key-->minhash mappings are different, then the user would have no way to tell whether the retrieved keys is what they want.

So, in a typical case we can just go ahead and merge the hash tables if the user does not request for a disjoint check. But in some cases when the user is given a MinHashLSH object they didn't produce, they might want to run a disjoint check to make sure they are not incorrectly merging the keys -- we should raise an exception.

So the disjoint check should only happen when the user is specifically asking for it, and by default it should be turned off.

hsicsa commented 6 months ago

Bravo! On Tuesday, March 12, 2024 at 08:45:03 a.m. PDT, Eric Zhu @.***> wrote:

Closed #205 as completed via a532f06.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>