Why does the number of templates differ between languages?

MattYoon commented 1 year ago

Hello, thank you for your inspiring work!

I assumed for xP3mt, all languages would have the same number of templates within a dataset, as they are all machine translated from the English templates. However, while I was taking a look at xP3mt, I noticed that the number of templates differ between languages within the same dataset. For example, XCOPA has 12, 10 and 5 templates for English, Chinese and Italian, respectively.

It seems like,

Not all English templates were MT'd to other languages. For XCOPA, only 5 out of 12 were MT'd from English to other languages.
Some languages have paraphrased duplicates. For XCOPA-zh, each template has a paraphrased version, resulting in 10 (5x2) templates.

I checked that XQuAD is also like this.

Your paper explains the experiments in great detail, however I believe the above detail was not mentioned. Could you please provide some additional information about this decision? Thank you in advance.

Muennighoff commented 1 year ago

Thanks for your detailed investigation!

For evaluation datasets (XCOPA, XNLI, XWinograd, XStoryCloze), we select 5 English templates at random. We then machine-translated them for all language splits. We also human-translated them for language splits that are part of the 46 xP3 languages. They are indicated by the prompt name of the template having either an "mt" (=machine-translated) or "ht" (=human-translated) suffix.

E.g. XCOPA-ZH is a xP3 / BLOOM language, hence it has both machine- & human-translated variants i.e. 5 2. XCOPA-IT is not a xP3 language, hence it has only machine-translated variants i.e. 5 1.

I agree that it could be clearer - I've rephrased part of the beginning of Section 4.3 from: " ...To investigate performance on non-English prompts, we additionally human- and machine-translated the English prompts used for evaluation. In Table 1, we report performance when prompting in non-English languages. BLOOMZ performs... "

to " ...To investigate performance on non-English prompts, we additionally human- and machine-translated the English evaluation prompts from Figure~\ref{fig:taskgen}. In Table~\ref{tab:promptlangl1}, we report performance on these. Results on machine-translated prompts in languages that are not part of the fine-tuning corpus, such as those in Figure \ref{fig:langgen}, are in Appendix \S\ref{sec:fullresults}. Table~\ref{tab:promptlangl1} shows that BLOOMZ performs... "

Let me know if this doesn't make it clearer!

MattYoon commented 1 year ago

Thank you so much for your clear explanation! Now I fully understand why the validation sets have different amount of prompts between languages.

However, I was a bit confused because XQuAD, which is a fine-tuning set, has 14, 7 and 9 prompts for Arabic, Chinese and others, respectively.

May I ask if this was unintentional? Thank you.

Muennighoff commented 1 year ago

Yeah it seems like xp3longcontext_zhmt & xp3longchar_zhmt were unintentionally skipped. The folder here https://huggingface.co/datasets/bigscience/xP3mt/tree/main/ar should contain all files that were used for Arabic.

MattYoon commented 1 year ago

I see! Thank you so much for the detailed explanation!

bigscience-workshop / xmtf

Why does the number of templates differ between languages? #9