Closed trisongz closed 3 years ago
I'm going to merge this now even though I haven't tested it, since Pile codebase is still in unstable. It would be great if someone could test and tell if this works, though.
Eventually my goal will be to make Pile repo, as installed through PyPI for instance, to be mostly exposing ways to pull Pile for different types of data consumption (i,e pytorch dataset, tfds, raw iterator, etc), with the replication stuff out of the spotlight, so this is a great first step.
The only minor thing is that once we get the Pile to its final location the url will need to be updated, but that's not a big deal, we just need to not forget when it does happen.
Notes are left within file.
allows user caching of tfrecords/downloaded files within own GCS to reduce bandwidth usage
maps dataset to Tensorflow Standard of tf.data.Dataset using tfds to ensure dataset is ready to use for TF1/TF2 training pipelines
can compile entire dataset through Colab without external VM (sort of hacky)
Keeps consistent with current reader implementation, with minor change of using simdjson library for faster json parsing.