Adding TFDS Implementation for the_pile

EleutherAI / the-pile

MIT License

1.51k stars 128 forks source link

I'm going to merge this now even though I haven't tested it, since Pile codebase is still in unstable. It would be great if someone could test and tell if this works, though.

Eventually my goal will be to make Pile repo, as installed through PyPI for instance, to be mostly exposing ways to pull Pile for different types of data consumption (i,e pytorch dataset, tfds, raw iterator, etc), with the replication stuff out of the spotlight, so this is a great first step.

The only minor thing is that once we get the Pile to its final location the url will need to be updated, but that's not a big deal, we just need to not forget when it does happen.

EleutherAI / the-pile

Adding TFDS Implementation for the_pile #74