facebookresearch / stopes

A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.
https://facebookresearch.github.io/stopes/
MIT License
251 stars 38 forks source link

NLLB mined data? #56

Closed gordicaleksa closed 1 year ago

gordicaleksa commented 1 year ago

Hi!

Did you ever release the mined data behind the NLLB project? (As mentioned in the paper's section 5.4 that's roughly 1.1B sentence pairs)

Thank you!

gordicaleksa commented 1 year ago

Apologies missed the metadata readme that you already released.

I also found that AllenAI replicated the dataset and released it to HuggingFace: https://huggingface.co/datasets/allenai/nllb