IndexError processing translate-train vs translate-test datasets

gowtham1997 commented 3 years ago

I downloaded the translate-train and transted-test datasets from the links in the Translate-Train and Translate-Test Data section of the readme page.

I am trying to train QA models with translated squad datasets in the translate-train folder.

import transformers
print(transformers.__version__) # outputs '4.0.0'

from transformers.data.processors.squad import SquadV1Processor
processor = SquadV1Processor()

examples = processor.get_train_examples('mlqa-translate-train/', filename='hi_squad-translate-train-train-v1.1.json')

When I run the above code to get train examples from translate-train jsons, I get IndexError(list index out of range) while the same code works for the files in mlqa-translate-test.

Do you happen to know why this is happening?

patrick-s-h-lewis commented 3 years ago

Hi, this looks like a problem with transformers, not with MLQA?

gowtham1997 commented 3 years ago

Yes, not an issue with MLQA :). This is a general question on the provided translated datasets.

Since the datasets are translated from squad and maintain the squad dataset format. I tried the standard squadprocessor but this doesn't seem to work on the Translate-Train datasets but works on Translate-test. The above code works for other multilingual datasets like TydiQA, Xquad.

I will check if this is related to the dataset or the library.

patrick-s-h-lewis commented 3 years ago

feel free to circle back if there is something up with the data that causes HF to break. The automatically-translated datasets are a bit noisy, its possible there are some things that are hard for systems to parse and use.

cn-boop commented 3 years ago

have you solve this problem? and how? thanks

facebookresearch / MLQA

IndexError processing translate-train vs translate-test datasets #16