Dataset Problem. - Githubissues

facebookresearch / flores

Facebook Low Resource (FLoRes) MT Benchmark

Other

694 stars 123 forks source link

Dataset Problem. #58

Closed ShadowVicky closed 1 year ago

ShadowVicky commented 1 year ago

In the paper , you wrote in the assamese language you have 738k mono text and 43.7k Bitext. But we are geeting only 1912 assamese-english pair data. Can you pls provide us the whole dataset i.e mono 738k and 43.7k Bitext. It will really helpful for us. Thanking you in advanced.

gwenzek commented 1 year ago

Hi, which paper are you referrering to ? Where are you downloading the data from ? I think there is a confusion between train/test data. With NLLB200 paper we shared some training data extracted from web corpus. You can download it from there: https://huggingface.co/datasets/allenai/nllb

The 1912 bitext is probably the dev + devtest portion of Flores200 dataset. Those translations are meant for evaluation, not training.

ShadowVicky commented 1 year ago

As mentioned in the abstract of "The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation", Flores-101 provides 3,001 English sentences translated to other languages (including Assamese). On downloading it from 1."https://github.com/facebookresearch/flores/tree/main/flores200", 2.https://huggingface.co/datasets/gsarti/flores_101 we get two sets: dev and devtest, each with 997 and 1012 sentences for various languages. Also, the paper mentions about a 43K bitext (Assamese, Bitext w/ En) and 738K mono text.

Question: How can we get the 43K bitext, 738K monotext and the 3,001 benchmark set?

gwenzek commented 1 year ago

we get two sets: dev and devtest, each with 997 and 1012 sentences for various languages.

That's expected. The test set is secret and you won't be able to download it. That's why you only have 2000 sentences and not 3000.

The bitext mention in this paper can be found on statmt.org: https://data.statmt.org/cc-matrix/