Closed icoxfog417 closed 4 years ago
Hi icoxfog417,
Thanks for your interest in the project! You can certainly add QA data in Japanese, but it may not result in as many datapoints as our other language, and it will be quite a lot of work.
The easy way (less work, but probably not many japanese instances can be created this way): First, you'll need to find parallel japanese sentences b_ja parallel with b_en for the existing MLQA english data using LASER. Extracting the paragraphs that contain b_ja will give you c_ja. Once you've done this, you can translate those questions q_en to japanese q_ja. finally, you'll need to run an annotation campaign to annotate answers in japanese a_ja. This will result in japanese being "added on" to existing MLQA data (and making 5-way parallel instances). The issue is you may not end up with many instances, due to finding only a small number of parallel sentences between english and Japanese in your parallel mining step.
The hard way (make as many instances as you want, but lots of work): If you wanted to make Japanese a "full member" of mlqa, you would need to find fresh parallel sentences that are parallel between english, Japanese, and two of the other MLQA languages, then you would need to generate QA data in english with a crowdsourcing campaign, and then translate the questions from english to all the target languages and then annotate answers in all the target languages - this is the only real way of making "lots" of Japanese instances, whilst maintaining the 4-way parallel aspect, but this is A LOT of work.
If there is a lot of appetite for adding languages to MLQA, then we may look into other mechanisms of adding languages more easily.
Thank you for your polite reply!
I think the "hard way" is equivalent to re-create MLQA dataset that includes Japanese. For that reason, I'm going to take an "easy way" that adds Japanese QA complementary by following steps.
b_ja
parallel with MLQA English sentences b_en
by LASER (vector based similarity).c_ja
that includes b_ja
.q_en
to q_ja
and (confirm the answer of q_ja
(a_ja
) exists in c_ja
).Sounds great! let me know how you get on
At first, thank you for the great work to evaluate multilingual QA.
I want to evaluate Japanese performance in the same context of MLQA. From the figure 1 in the MLQA: Evaluating Cross-lingual Extractive Question Answering, I think there are 2 ways to add other languages QA.
b
,c
) and formulatec
to the question (Left to Right).Which is more preferable or is there a better way than above?