ddh0 commented 1 month ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

The dictionary returned by _LlamaModel.metadata() should include tokenizer.ggml.tokens as a key, so the vocabulary of the model can be accessed from the high-level API.

Current Behavior

The dictionary returned by _LlamaModel.metadata() does not include tokenizer.ggml.tokens as a key, so the vocabulary of the model cannot be accessed from the high-level API.

Environment and Context

Running latest llama-cpp-python built from source - package version 0.2.76 at the time of writing.

Steps to Reproduce

Construct an instance of Llama
View Llama.metadata
Look for a key called tokenizer.ggml.metadata
Do not find it

ddh0 commented 1 month ago

For reference, here is the code I am currently using to read GGUF metadata from a file, including tokenizer.ggml.tokens.

This code is not as robust as the code currently in llama-cpp-python, but I have been using it for a long time and it does work as expected with various models - you can cross-reference with the llama.cpp backend output to verify this.

Hopefully this code can be useful as a reference while looking into this issue.

Thanks, @abetlen!

CISC commented 1 month ago

The problem is that the llama.cpp API does not return metadata arrays.

Using gguf.py to read metadata in the same fashion you do would solve that, but it would technically mean loading the model twice, which is not a good solution. :(

ddh0 commented 1 month ago

The code I shared doesn't load the model, it just reads bytes from the header of the GGUF file, unless I'm woefully misunderstanding something. Could you please clarify what you mean by loading the model twice?

CISC commented 1 month ago

I meant if going the route of side-loading the metadata it would probably be best using the official gguf.py, which even though it doesn't load the whole model into memory technically means you would open it twice and at least duplicate some of that data.

The better approach would be to add support for metadata arrays in the llama.cpp API, but I guess it is a little cumbersome to expose somewhat cleanly through a C ABI.

ddh0 commented 1 month ago

Oh, I see. I'll take a look at gguf.py and see if I can make a PR that would help sometime soon. Thanks

CISC commented 1 month ago

Do note that the currently published gguf.py is quite outdated compared to llama.cpp master branch, but hopefully it will get bumped soon.

ddh0 commented 1 month ago

I am working on a proper fix for this: ~~1497~~ #1525

abetlen / llama-cpp-python

_LlamaModel.metadata() does not return `tokenizer.ggml.tokens` #1495

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce