SmerkyG / gptcore

Fast modular code to create and train cutting edge LLMs
Apache License 2.0
62 stars 9 forks source link

Dataset issue crashes training #2

Closed Iron-Bound closed 6 months ago

Iron-Bound commented 7 months ago

So I can get about 10min into training before it crashes, its consistant for multiple attempts, internet connect is 600mbit.

  1. Is there a way to cache the files in advance or add auth to huggingface?
  2. Would I need to download the the 800gb pile dataset as a fix?
'(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: fd98ddf
2-4690-4322-b018-8226d93d2e7d)')' thrown while requesting GET https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/14.jsonl.zst 
SmerkyG commented 7 months ago

Sorry, that's frustrating. I've seen this happen once in a while in the past but not consistently the way it's happening for you. I don't know of a way to auth HF but if you find one let me know and I'd be happy to add support for it to gptcore.

The streaming solution is meant as an easy way for developers/researchers to be able to quickly get to testing without doing a ton of setup, especially via rented instances where there's not a lot of persistent storage or you don't want to pay for the expensive persistent storage offered.

Downloading your dataset is always an option, but as you mention it may be quite large. You can download a single part of it though. This specific dataset comes in many files and you can get just one and easily change the code in pile.py to refer to only a single one. You could also switch to a smaller HF dataset and download it, like teven/enwiki10k or teven/enwiki100k.

Another option, which I've used before with mixed success (sometimes the CDN can be a little flakey), is to download your dataset and host it somewhere like a CDN. This can be quite cost effective and gives you full control over the data.

Iron-Bound commented 7 months ago

Hey thanks for the response,

As you said the streaming thing is really neat, so I'll take a look into how file transfer are being done with the official huggingface-cli library, maybe I'll dig up some clues.

If all else fails, I'll learn how the datasets factory works and solve it with the minipile dataset for testing.

Iron-Bound commented 7 months ago

So I feel the issue isn't network, as I'm able to curl the files it lists as having an issue. something with my rocm container may be messing with data workers, give me some time to investigate.

SmerkyG commented 7 months ago

Thanks for looking into it!

On Thu, Jan 25, 2024, 8:07 AM Andrew @.***> wrote:

So I feel the issue isn't network, as I'm able to curl the files it lists as having an issue. something with my rocm container may be messing with data workers, give me some time to investigate.

— Reply to this email directly, view it on GitHub https://github.com/SmerkyG/gptcore/issues/2#issuecomment-1910517108, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDK33SA4BNWK5VJCZM3VYTYQJ7K3AVCNFSM6AAAAABCFF2RUCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGUYTOMJQHA . You are receiving this because you commented.Message ID: @.***>

VatsaDev commented 7 months ago

add auth to huggingface?

Im not fully sure what you mean, but could you not use

from huggingface_hub import login
login("API-KEY")
Iron-Bound commented 6 months ago

Good and bad news, no longer getting the disconnected with container rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2 But unable to find the cause unfortunately 😿

SmerkyG commented 6 months ago

Sounds like it was something unrelated, which is good! Glad you got it sorted out.