Closed MattYoon closed 1 year ago
Thanks for your detailed investigation!
For evaluation datasets (XCOPA, XNLI, XWinograd, XStoryCloze), we select 5 English templates at random. We then machine-translated them for all language splits. We also human-translated them for language splits that are part of the 46 xP3 languages. They are indicated by the prompt name of the template having either an "mt" (=machine-translated) or "ht" (=human-translated) suffix.
E.g. XCOPA-ZH is a xP3 / BLOOM language, hence it has both machine- & human-translated variants i.e. 5 2. XCOPA-IT is not a xP3 language, hence it has only machine-translated variants i.e. 5 1.
I agree that it could be clearer - I've rephrased part of the beginning of Section 4.3 from: " ...To investigate performance on non-English prompts, we additionally human- and machine-translated the English prompts used for evaluation. In Table 1, we report performance when prompting in non-English languages. BLOOMZ performs... "
to " ...To investigate performance on non-English prompts, we additionally human- and machine-translated the English evaluation prompts from Figure~\ref{fig:taskgen}. In Table~\ref{tab:promptlangl1}, we report performance on these. Results on machine-translated prompts in languages that are not part of the fine-tuning corpus, such as those in Figure \ref{fig:langgen}, are in Appendix \S\ref{sec:fullresults}. Table~\ref{tab:promptlangl1} shows that BLOOMZ performs... "
Let me know if this doesn't make it clearer!
Thank you so much for your clear explanation! Now I fully understand why the validation sets have different amount of prompts between languages.
However, I was a bit confused because XQuAD, which is a fine-tuning set, has 14, 7 and 9 prompts for Arabic, Chinese and others, respectively.
May I ask if this was unintentional? Thank you.
Yeah it seems like xp3longcontext_zhmt & xp3longchar_zhmt were unintentionally skipped. The folder here https://huggingface.co/datasets/bigscience/xP3mt/tree/main/ar should contain all files that were used for Arabic.
I see! Thank you so much for the detailed explanation!
Hello, thank you for your inspiring work!
I assumed for xP3mt, all languages would have the same number of templates within a dataset, as they are all machine translated from the English templates. However, while I was taking a look at xP3mt, I noticed that the number of templates differ between languages within the same dataset. For example, XCOPA has 12, 10 and 5 templates for English, Chinese and Italian, respectively.
It seems like,
I checked that XQuAD is also like this.
Your paper explains the experiments in great detail, however I believe the above detail was not mentioned. Could you please provide some additional information about this decision? Thank you in advance.