Closed Iron-Bound closed 6 months ago
Sorry, that's frustrating. I've seen this happen once in a while in the past but not consistently the way it's happening for you. I don't know of a way to auth HF but if you find one let me know and I'd be happy to add support for it to gptcore.
The streaming solution is meant as an easy way for developers/researchers to be able to quickly get to testing without doing a ton of setup, especially via rented instances where there's not a lot of persistent storage or you don't want to pay for the expensive persistent storage offered.
Downloading your dataset is always an option, but as you mention it may be quite large. You can download a single part of it though. This specific dataset comes in many files and you can get just one and easily change the code in pile.py to refer to only a single one. You could also switch to a smaller HF dataset and download it, like teven/enwiki10k or teven/enwiki100k.
Another option, which I've used before with mixed success (sometimes the CDN can be a little flakey), is to download your dataset and host it somewhere like a CDN. This can be quite cost effective and gives you full control over the data.
Hey thanks for the response,
As you said the streaming thing is really neat, so I'll take a look into how file transfer are being done with the official huggingface-cli
library, maybe I'll dig up some clues.
If all else fails, I'll learn how the datasets factory works and solve it with the minipile dataset for testing.
So I feel the issue isn't network, as I'm able to curl the files it lists as having an issue. something with my rocm container may be messing with data workers, give me some time to investigate.
Thanks for looking into it!
On Thu, Jan 25, 2024, 8:07 AM Andrew @.***> wrote:
So I feel the issue isn't network, as I'm able to curl the files it lists as having an issue. something with my rocm container may be messing with data workers, give me some time to investigate.
— Reply to this email directly, view it on GitHub https://github.com/SmerkyG/gptcore/issues/2#issuecomment-1910517108, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDK33SA4BNWK5VJCZM3VYTYQJ7K3AVCNFSM6AAAAABCFF2RUCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGUYTOMJQHA . You are receiving this because you commented.Message ID: @.***>
add auth to huggingface?
Im not fully sure what you mean, but could you not use
from huggingface_hub import login
login("API-KEY")
Good and bad news, no longer getting the disconnected with container rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2
But unable to find the cause unfortunately 😿
Sounds like it was something unrelated, which is good! Glad you got it sorted out.
So I can get about 10min into training before it crashes, its consistant for multiple attempts, internet connect is 600mbit.