bigscience-workshop / catalogue_data

Scripts to prepare catalogue data
Apache License 2.0
8 stars 1 forks source link

add filter for small docs in datasets #14

Closed HugoLaurencon closed 2 years ago

HugoLaurencon commented 2 years ago

Especially for lm_es_opus100, but can be used for other datasets

Num docs removed: 784079/1000000 (78.41%).

lvwerra commented 2 years ago

Looks good to me. I guess we can apply that to high resource languages but should be careful on low resource languages.

HugoLaurencon commented 2 years ago

LGTM! Did you re-run to double check this works? I have a doubt in multiprocessing.

I tried multiprocessing again and it worked yes