EleutherAI / the-pile

MIT License
1.44k stars 122 forks source link

Code generation #87

Closed 6r1d closed 3 years ago

6r1d commented 3 years ago

Hello. As far as I understand, and correct me if I'm wrong, it's possible to add more sample texts on the topic to the new versions of The Pile.

This is a list I can think of right now. It is probably very wrong in regards to many things, I have no experience in preparing such datasets. If scraping some of the sites will be considered useful, I can try to help.

I'm sure I'm missing quite a few of good ideas here. There are many algorithm implementations inside the programming language code (Python batteries, for example), and there are many LibC implementations, to have a look at, as well.

UPD: I'm reading the paper "The Pile: An 800GB Dataset of Diverse Text for Language Modeling" and I've noticed that GitHub and StackExchange were scraped already, though I'll leave the issue to discuss the other sites. It's not much, but I think those will be nice to have.

thoppe commented 3 years ago

At the moment, I don't think new additions are being accepted (@StellaAthena would know more). What helped us though when we were designing The Pile was to determine the size and quality of each dataset before we started scrapping. For those that you listed, getting a rough estimate on useable text size (in terms of GB) would be a great first place to start for evaluation