PrimeIntellect-ai / prime

prime (previously called ZeroBand) is a framework for efficient, globally distributed training of AI models over the internet.
Apache License 2.0
186 stars 22 forks source link

Data subset pull script is problematic. #114

Open Nottlespike opened 1 week ago

Nottlespike commented 1 week ago

Related to PR #111 The script assumes that the user has Git LFS installed but it's not in the documentation on how to install then setup then initialize Git LFS. For most users this is very likely going to break if they attempt to use said script. A likely much better implementation would be to use this Github gist, https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f#file-hfd-sh, integrated into the project as it uses only git and aria2c/wget as dependencies. There are also other advantages to integrating this script as you would not need to use a SSH keys as that adds complexity to the project and issues with ports as it's likely that port may be busy as I SSH into my cloud instance to setup prime.

Jackmin801 commented 1 week ago

Thanks for the suggestion! It would definitely be easier if we could pull huggingface shards through their http server. My experience with this is that the http server eventually rate limits you, which is not the case for git lfs pulls. This was true last year and could be different now though.

Nottlespike commented 1 week ago

@Jackmin801 So I used this script to help a LOT of people download L3.1 405B when Git LFS timed out for them. You can also pass the HF read token from the env file to prevent rate limiting but I have never seen that with this script even pulling as anonymous.

Jackmin801 commented 1 week ago

Ah ok. Lets change it then. Do you think you could write the PR for this? Otherwise I can do it tmr. Thanks again for the suggestion :)

Nottlespike commented 1 week ago

I'm going to take a shot at the PR as I have integrated it before and use it extensively personally.

Nottlespike commented 1 week ago

@Jackmin801 I have submitted a PR and fixed a possible "security" issue introduced by the current ~/10B/H100.toml. It is best to verify that the subset logic is working as intended.