EleutherAI / the-pile

MIT License
1.48k stars 129 forks source link

Adding TFDS Implementation for the_pile #74

Closed trisongz closed 3 years ago

trisongz commented 3 years ago

Notes are left within file.

leogao2 commented 3 years ago

I'm going to merge this now even though I haven't tested it, since Pile codebase is still in unstable. It would be great if someone could test and tell if this works, though.

Eventually my goal will be to make Pile repo, as installed through PyPI for instance, to be mostly exposing ways to pull Pile for different types of data consumption (i,e pytorch dataset, tfds, raw iterator, etc), with the replication stuff out of the spotlight, so this is a great first step.

The only minor thing is that once we get the Pile to its final location the url will need to be updated, but that's not a big deal, we just need to not forget when it does happen.