ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://ekzhu.github.io/datasketch
MIT License
2.59k stars 295 forks source link

Advice for minhash with sparse dataset #193

Open mathephysicist opened 1 year ago

mathephysicist commented 1 year ago

I have a dataset that has is very sparse. That is, it has multiple null fields and multiple variations of the same entity.

Essentially, FN, LN, field1, field2, ... , fieldk, ... fieldN filled, filled, null, null, ..., Value, Null, ... Null (This is entity 1) filled maybe typo or more info than filled from above, filled, null, ..., Value (with typo), (maybe this one is filled), ... Null (This is same entity as 1) filled maybe typo or more info than filled from above, filled, null, ..., Value2 ( different then above), (maybe this one is filled), ... Null (This is same entity as 1) Then we have other entities entirely

I've been leveraging minhashensemble code and tried a few varieties (indexing per column to deal with nulls better), and concatenating all together with word null for empty fields (or just space for that entry), evaluating different containment scores. A bunch of the varieties seem to produce slightly better performance on some situations and slightly worse on others. Does anyone know of a better way to approach this type of problem or recommend a resource to dive a bit deeper into figuring out what may work for this problem?

ekzhu commented 1 year ago

Thanks for posting the question. I think it would be great if you can clarify:

  1. What is the input to the jaccard/containment similarity function -- without using minhash.
  2. What is the intended scenario: e.g., search, one-off similarity estimation, etc.