issues
search
Eventual-Inc
/
Daft
Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.79k
stars
108
forks
source link
Can you provide an example of large-scale text deduplication, such as the following example
#2235
Open
simplew2011
opened
3 weeks ago
simplew2011
commented
3 weeks ago
https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py
https://github.com/phdinds-aim/alis/blob/68c7f56a08fa5cfe10638ea45292914620c9f5cf/notebooks/lsh-for-minhash/05_demo_minhash_lsh.ipynb
https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/fuzzy_deduplication/README.md
https://xorbits.io/blogs/text-deduplicate
jaychia
commented
3 weeks ago
Great idea! Let me work on something :)