How to add other language dataset

icoxfog417 commented 4 years ago

At first, thank you for the great work to evaluate multilingual QA.

I want to evaluate Japanese performance in the same context of MLQA. From the figure 1 in the MLQA: Evaluating Cross-lingual Extractive Question Answering, I think there are 2 ways to add other languages QA.

Just only translate English questions to the specific language (Right part only).
Collect the parallel Wikipedia sentence/paragraph (b, c) and formulate c to the question (Left to Right).
- To take this approach, I have to access 385,396 4-way parallel sentences (Table 9) to gather Japanese Wikipedia article.

Which is more preferable or is there a better way than above?

patrick-s-h-lewis commented 4 years ago

Hi icoxfog417,

Thanks for your interest in the project! You can certainly add QA data in Japanese, but it may not result in as many datapoints as our other language, and it will be quite a lot of work.

The easy way (less work, but probably not many japanese instances can be created this way): First, you'll need to find parallel japanese sentences b_ja parallel with b_en for the existing MLQA english data using LASER. Extracting the paragraphs that contain b_ja will give you c_ja. Once you've done this, you can translate those questions q_en to japanese q_ja. finally, you'll need to run an annotation campaign to annotate answers in japanese a_ja. This will result in japanese being "added on" to existing MLQA data (and making 5-way parallel instances). The issue is you may not end up with many instances, due to finding only a small number of parallel sentences between english and Japanese in your parallel mining step.

The hard way (make as many instances as you want, but lots of work): If you wanted to make Japanese a "full member" of mlqa, you would need to find fresh parallel sentences that are parallel between english, Japanese, and two of the other MLQA languages, then you would need to generate QA data in english with a crowdsourcing campaign, and then translate the questions from english to all the target languages and then annotate answers in all the target languages - this is the only real way of making "lots" of Japanese instances, whilst maintaining the 4-way parallel aspect, but this is A LOT of work.

If there is a lot of appetite for adding languages to MLQA, then we may look into other mechanisms of adding languages more easily.

icoxfog417 commented 4 years ago

Thank you for your polite reply!

I think the "hard way" is equivalent to re-create MLQA dataset that includes Japanese. For that reason, I'm going to take an "easy way" that adds Japanese QA complementary by following steps.

Find Japanese sentences b_ja parallel with MLQA English sentences b_en by LASER (vector based similarity).
Extract paragraph (contexts) c_ja that includes b_ja.
Translate q_en to q_ja and (confirm the answer of q_ja (a_ja) exists in c_ja).

patrick-s-h-lewis commented 4 years ago

Sounds great! let me know how you get on

facebookresearch / MLQA

How to add other language dataset #1