Open MisterKloudy opened 1 month ago
I would like to get a minhash with alternative hash algorithms such as the first four bytes of SHA1 as implemented in https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/minhash_deduplication_spark.py
The deduplication rate is empirically much better in this pyspark implementation which I am guessing has to do with the higher rate of collisions from the truncation of the hash.
Has this been completed @andrewgazelka? I see #3052 has been merged!
I would like to get a minhash with alternative hash algorithms such as the first four bytes of SHA1 as implemented in https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/minhash_deduplication_spark.py
The deduplication rate is empirically much better in this pyspark implementation which I am guessing has to do with the higher rate of collisions from the truncation of the hash.