llama : make vocabs LFS objects?

ggerganov commented 5 months ago

It's nice to have a collection of vocabs using different pre-tokenizers in order to test tokenization more widely. However, the number of vocab files controlled in the repo will keep growing:

https://github.com/ggerganov/llama.cpp/tree/master/models

These files are typically a few MB, so the repo size is significantly affected by them.

One option is to make these files LFS objects. Another option is to not source control them and either remove the tests, or generate them on the fly. But the latter might be flaky because we will depend on many 3rd party repositories to provide the tokenizers.

Are there any better alternatives?

Update: git lfs is not an option. I think for the short-term we will commit vocabs only for new types of pre-tokenizers. The vocab data compresses relatively good (factor ~x3), so hopefully the repo size will not be affect too badly

turian commented 5 months ago

LFS seems like the best choice, since it gives the project good experimental control in its tests.

I do think the project should control the vocabularies used for testing, and not 3rd parties. Hosting it off github is fine, but I don't see a reason for that besides size. Alternatives would be to create a second repo like: llama.cpp-full-test-suite that contains bigger assets, and perhaps use that as a submodule. Submodules are a bit gross, but you could just pull llama-cpp-full-test-suite into the current directory when doing 'make test' or CI/CD, if you don't think every user should wait for large repo pulls.

teleprint-me commented 5 months ago

Syncing will become painful as things progress over time if we add another repository. Perhaps IPFS is an option? I have little experience with it, but as centralization continues to overtake everything, I see this a possible alternative. The issue is that IPFS is a protocol, not an implementation, so this will not be easy. This could probably be it's own repository though and could support hosting necessary models, vocabs, tools, etc. in a reliably distributed fashion. You would still need a server to speed things up because it is incredibly slow on its own and dealing with growing file sizes over time will put pressure on nodes supporting it. Harddisk space is cheap, but bandwidth is pricy, so something to keep in mind. Taking a look at how the Linux repo is managed might help because it is massive and takes forever to clone and open up locally. Linus now requires a threadripper just to compile it in a reasonable amount of time.

I see a few options:

Accept 3rd party support and reliance and attempt to be prepared for any issues that arise due to that reliance.
Accept the need for LFS even if it isn't wholly reliable and expect issues with it. I had to create a custom tool for managing my huggingface repos because of this as a result. LFS is just not reliable with model files. They're too big.
Opt for a distributed implementation. There are multiple options here and the community support and traction is there so there may be a window of opportunity here.

These are just some off-the-cuff suggestions and should be taken with a grain of salt. Hopefully it sparks some discussion and inspires some ideas.

turian commented 5 months ago

@teleprint-me what sort of bad experience have you had with LFS? What does your custom tool do, what sort of issues does it fix?

I haven't had issues with LFS for MB files, but yeah I can imagine there might be issues for GB-scale files.

teleprint-me commented 5 months ago

It's very slow. It's an extremely common issue that's only obvious afterwards. I've experienced dropped packets, disconnected from the server, partial clones even with LFS properly installed and configured, etc. I had a nice reminder of why I avoid LFS once I started using HF. There's an article that explains in-depth why git and git-lfs have issues with large files and repos, but it'll take some time to find it.

It depends on the host too. Some are better than others. I've never had an issue with cloning and syncing with the Linux repo, but it takes forever. It's only gotten larger as well because it is a mono repo. I stopped tinkering around with the source code itself because of life stuff as that was awhile ago.

I find that simply requesting the file(s) itself is much more straightforward. What's both fun and interesting is that IPFS is actually how Llama 1 is hosted at the moment.

Sauce: ipfs://bafybeif6nnhgmnxmatyniewkmrzsjyfqampiwg2hcatjahggvgo3rey46a/

Note that this link will not work unless you have an IPFS node setup already. I use a gateway because it's easier to access. GitHub also doesn't seem to like it either. The links are an eye sore, but they're mappable to colloquial domain names. Note: Be sure to validate the hash.

ggerganov commented 5 months ago

Unfortunately git lfs is not an option - too expensive:

teleprint-me commented 5 months ago

@ggerganov That is expensive.

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp

llama : make vocabs LFS objects? #7128