Could you please release the processed pretraining data?

RUCAIBox / MVP

This repository is the official implementation of our paper MVP: Multi-task Supervised Pre-training for Natural Language Generation.

Apache License 2.0

68 stars 3 forks source link

Could you please release the processed pretraining data? #8

Closed phellonchen closed 1 year ago

StevenTang1998 commented 1 year ago

You can download them at the link: https://huggingface.co/RUCAIBox. Since some datasets have license limitations, we cannot merge them into one dataset. You can merge them by your own.

phellonchen commented 1 year ago

Thanks. One more question, where can I find the code about a temperature-scaled mixing strategy (Raffel et al., 2020) with a rate of T = 2 to mitigate the disparity in tasks and datasets ? I have not found it in https://github.com/RUCAIBox/TextBox.

StevenTang1998 commented 1 year ago

The general code of pre-training is still under developping. For pre-training MVP, we just conducted the temperature-scaled mixing strategy by copying instances. You can also use it as a simple alternative. For example, A dataset has 2 instances and B dataset has 8 instances. We merge them into a unified datasest with the temperature-scaled mixing strategy by doubling the instances in A dataset.