Closed HugoLaurencon closed 2 years ago
Looks good to me. I guess we can apply that to high resource languages but should be careful on low resource languages.
LGTM! Did you re-run to double check this works? I have a doubt in multiprocessing.
I tried multiprocessing again and it worked yes
Especially for lm_es_opus100, but can be used for other datasets
Num docs removed: 784079/1000000 (78.41%).