Closed w32zhong closed 1 year ago
Hi, for deduplication, you need to install the deduplication code from https://github.com/google-research/deduplicate-text-datasets, as described in the installation instructions.
I'll look into hosting the processed dataset somewhere for convenience in testing!
Thank you so much @JonasGeiping . Somehow I missed that part.
For everyone finding this issue later, please check this section: https://github.com/JonasGeiping/cramming#preprocessed-datasets
Hi, I am also trying to replicate the preprocessed c4 dataset. Since the default config has
deduplicate_entries: True
, however, the "dedup tool" seems not found:cramming/dedup/release/dedup_dataset: not found
.I am wondering where to get the dedup tool, and if possible, can we download the preprocessed c4 dataset somewhere?