allenai / satlas

Apache License 2.0
184 stars 19 forks source link

Consider hosting datasets in this GitHub repo, using our XetData extension #25

Closed srinify closed 8 months ago

srinify commented 8 months ago

Hey folks! I work at XetHub and we scale Git to handle large files. We recently launched a GitHub integration that brings this into GitHub repos too and it's free forever for public repos.

As an example, we brought 100+ GB of onxx model files into this repo: https://github.com/xetdata/onnx-models

After people have installed our tiny Git extension, whenever they clone from this repo all large files are downloaded. They can optionally ignore the large files as well.

If that sounds interesting, I'd love to collaborate to make this happen.

favyen2 commented 8 months ago

We still intend to upload the data to Hugging Face, however we will use a separate repository for the code and for the data since even with lfs having data in the same repo makes having a full copy of the repo unwieldy (especially with our 40 TB dataset).