EleutherAI / the-pile

MIT License
1.51k stars 128 forks source link

"Github" code data download only #101

Open HangXue-lab opened 1 year ago

HangXue-lab commented 1 year ago

The size of pile is too big for me. I just want to download the "Github" code data. But the number of Pile train file is 30. I would like to know exactly which file contains the "Github" code data.

igorbrigadir commented 1 year ago

The data is already processed by that stage, and may not be what you want. You probably want the github.tar from the preliminary components https://the-eye.eu/public/AI/pile_preliminary_components/github.tar and process it yourself.

osainz59 commented 1 year ago

The link is no longer working, is there another link to obtain the data?