Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.35k stars 166 forks source link

Exposing number of bytes to keep & hashing algorithm in the expression minhash() #2958

Open MisterKloudy opened 1 month ago

MisterKloudy commented 1 month ago

I would like to get a minhash with alternative hash algorithms such as the first four bytes of SHA1 as implemented in https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/minhash_deduplication_spark.py

The deduplication rate is empirically much better in this pyspark implementation which I am guessing has to do with the higher rate of collisions from the truncation of the hash.

jaychia commented 2 weeks ago

Has this been completed @andrewgazelka? I see #3052 has been merged!