-
Hello! I have a few questions and observations regarding the deduplication approach using MinHash in this repository. Specifically, I’m interested in some intuition around handling false positives and…
-
right now we have to do
```bash
cargo bench -q -p daft-minhash --bench=minhash -- compare target/benchmarks/minhash
cargo bench -q -p daft-minhash --bench=windowed -- compare target/benchmarks/w…
-
Documentation here : http://ekzhu.com/datasketch/lsh.html#minhash-lsh-at-scale
- Remove MinHashCustom class implementation
- Add MinHash LSH at scale : use of Redis database to store objects
-
I'm curious what is the performance difference between c-minhash and r-minhash. Are there any plans to implement the original c-minhash in this package?
-
I've been running some large-scale benchmarking with minhash deduplication on SLURM clusters, loosely following [this example](https://github.com/huggingface/datatrove/blob/main/examples/minhash_dedup…
-
I would like to get a minhash with alternative hash algorithms such as the first four bytes of SHA1 as implemented in https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/mi…
-
This should just involve implementing the Hasher interface with a struct that produces the minhash of []float64.
-
Hi kmer-db team,
I was not able to find the minhash sketch size but only filter fraction, what does this mean with respect to sketch size? It was very clear in all other minhash implementations suc…
-
Noticed your comment on [Hacker News](https://news.ycombinator.com/item?id=33123972) about this repo. I worked on something similar at a previous job (so I don't have the code the share), but looked …
-