bigscience-workshop / xmtf

Crosslingual Generalization through Multitask Finetuning
https://arxiv.org/abs/2211.01786
Apache License 2.0
513 stars 37 forks source link

Questions about datas #3

Closed lbourdois closed 1 year ago

lbourdois commented 1 year ago

Hi πŸ˜€

First of all, thank you for your very interesting work πŸš€

I was wondering about two points where I didn't find an answer by myself (maybe I didn't search well) and I would need your help.

1) I would have liked to know for a given task, what is the prompt used for finetuning for a given language. For example, let's say French summarization. So I started to search to know which prompt were used for the French summarization but I didn't find a list that would summarize such information. PromptSource provides 2085 prompts in English, but nothing about translations in other languages. Does such a list exist? πŸ€”

2) To try to have a solution to the previous point, I thought I had to download the xP3mt dataset and read directly which prompts were used. The problem is that you can actually download all the data for a selected language but you can't do an additional filter on the task/(sub)dataset. Would this be something that could be added? Or even better, create individual multilingual datasets of the translations you have done. For example, having the ability to upload an "mSamSum" which would be the multilingual version of "SamSum" which is purely in English at the base. This would probably allow to be reused in other works, especially monolingual ones. If I take again the example of French summary, there are few data currently available: Orangesum, XLSum and Wiki-lingua. Having easy access to the translations of CNN Daily Mail, Gigaword, MultiNews, SamSum and XSum would allow to do very interesting things 🀯

lbourdois commented 1 year ago

By opening all the datasets and referring to https://github.com/bigscience-workshop/promptsource/issues/838, it turns out that you did not translate all the datasets from English to French as I understood but add French part from 8 multilingual datasets (available in https://huggingface.co/datasets/bigscience/xP3/viewer/fr/train) and translated the prompts in French for 3 of these 8 datasets. So my questions are not relevant, my bad, I close.