EleutherAI / the-pile

MIT License
1.46k stars 126 forks source link

Move processing code to this repo #41

Closed StellaAthena closed 3 years ago

StellaAthena commented 3 years ago

Having a whole bunch of repositories scattered across GitHub for processing code is no beuno. We should really make a directory in this repo for housing them. If people want to keep theirs off-repo that's fine, but I really don't see why we shouldn't house them here.

I've assigned people who have been loud about this in the past to this issue.

thoppe commented 3 years ago

A sub-directory for each project would be great. I don't think we should pollute the main EleutherAI github namespace for each data pull, especially since some of the pull codes are rather small. However, it might be nice to have a repo just for data processing -- this way the development of the Pile itself can proceed independent of processing and adding a contributor for commits makes more sense than here.

StellaAthena commented 3 years ago

@thoppe I think there’s something called “subrepositories” on GitHub. To be clear, you just mean making a directory and putting the code in it? I would actually recommend a two-layer system:

the-pile/
— data processing/
— — Wikipedia/
— — — main.py
— — arXiv/
— — — main.py

Some of the processing code is not all in one file, which is why I’m recommending this. We can look at consolidating each script into a single file though, if people dislike the added layer. (I know @leogao2 has strong opinions about the number of clicks to get to things).

StellaAthena commented 3 years ago

However, it might be nice to have a repo just for data processing -- this way the development of the Pile itself can proceed independent of processing and adding a contributor for commits makes more sense than here.

This is interesting, and something I hadn’t considered. I’m not sure how much sense it makes though... the-pile proper is closely tied in with the data processing code. What would it look like for the-pile and the data processing to diverge? Isn’t this the data processing for the pile?

thoppe commented 3 years ago

Sorry that wasn't clear, your statement of a "two-layer system" is what I had envisioned. :+1:

Furthermore, I suggest it be moved to it's own repo. 1] For consolidation, right now there are scripts all over the place 2] if data processing had its own repo, we, the data collectors, could push to it without regard to the final stage of the pipeline (which is what The-Pile looks like). Permissions for the data collection repo can be more permissive than this one. Since @bmk is managing the final stages of the data, it might be useful and less cognitive load split them up.

StellaAthena commented 3 years ago

We have decided that we will copy processing code into the EleutherAI GitHub but not into this directory specifically. We may make a “data processing” directory that contains each data processing code base as a sub directory in the future.