bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Add exact document deduplicate scripts #407

Closed SaulLu closed 2 years ago

SaulLu commented 2 years ago

Used to create seeds_batch_1_2_deduplicate_article_on_clean on the gcp bucket