JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.
MIT License
1.29k stars 100 forks source link

preprocessed c4 dataset? #7

Closed w32zhong closed 1 year ago

w32zhong commented 1 year ago

Hi, I am also trying to replicate the preprocessed c4 dataset. Since the default config has deduplicate_entries: True, however, the "dedup tool" seems not found: cramming/dedup/release/dedup_dataset: not found.

I am wondering where to get the dedup tool, and if possible, can we download the preprocessed c4 dataset somewhere?

JonasGeiping commented 1 year ago

Hi, for deduplication, you need to install the deduplication code from https://github.com/google-research/deduplicate-text-datasets, as described in the installation instructions.

I'll look into hosting the processed dataset somewhere for convenience in testing!

w32zhong commented 1 year ago

Thank you so much @JonasGeiping . Somehow I missed that part.

JonasGeiping commented 1 year ago

For everyone finding this issue later, please check this section: https://github.com/JonasGeiping/cramming#preprocessed-datasets