Closed yushengsu-thu closed 7 months ago
Hi, the Huggingface dataset at EleutherAI/proof-pile-2
does not have the upsampling ratio applied to it.
Upsampling during our training was performed by first tokenizing each dataset mixture component separately, then using the GPT-NeoX library to set mixture ratios, which can specifically be found here: https://github.com/EleutherAI/gpt-neox/blob/5dd366539803dbf1fd725cc057013fd002a4cfd4/configs/data_mixture.yml
If I download "EleutherAI/proof-pile-2", "default", I will get the mixture data (arxiv, open-web-math, algebraic-stack) with 2:4:1 atio right?