JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.
MIT License
1.29k stars 100 forks source link

Preprocessed files on S3/Google Drive #13

Closed tals closed 1 year ago

tals commented 1 year ago

Hey there, and thank you for this wonderful work!

I'm trying to grab the prepcoessed dataset files from Dropbox, but it is sort of a pain to remotely download it due to Dropbox putting restrictions on the links :\

Would it be possible for you to mirror it on Google Drive (so gdown would work) or on S3 (via Requester Pays)?

JonasGeiping commented 1 year ago

Hi, sorry to hear that. I didn't expect the demand for preprocessed data to be more than sporadic, and to overwhelm the dropbox limits.

I don't have a gdrive account large enough to host these files and I've never worked with AWS in any serious capacity to set this up (and requester pays would not be my favorite model).
What makes most sense for me would be a huggingface datasets upload, but don't have time to set that up at the moment.

JonasGeiping commented 1 year ago

This is now very easy with https://github.com/JonasGeiping/cramming/releases/tag/Torch2.1. You can just stream data directly from huggingface.