Question: load_dataset("EleutherAI/proof-pile-2", "default") ?

EleutherAI / math-lm

MIT License

1.03k stars 78 forks source link

Closed yushengsu-thu closed 7 months ago

yushengsu-thu commented 7 months ago

If I download "EleutherAI/proof-pile-2", "default", I will get the mixture data (arxiv, open-web-math, algebraic-stack) with 2:4:1 atio right?

haileyschoelkopf commented 7 months ago

Hi, the Huggingface dataset at EleutherAI/proof-pile-2 does not have the upsampling ratio applied to it.

Upsampling during our training was performed by first tokenizing each dataset mixture component separately, then using the GPT-NeoX library to set mixture ratios, which can specifically be found here: https://github.com/EleutherAI/gpt-neox/blob/5dd366539803dbf1fd725cc057013fd002a4cfd4/configs/data_mixture.yml