Open RoyLai-InfoCorp opened 3 weeks ago
It's not utf-8.
What is done here is the tokenization of the prompt (to get a sense of the consumption of token when filling the LLM context window).
To perform this tokenization, the library creates dynamically a codec and registers it in the python codec registry (here: https://github.com/advanced-stack/py-llm-core/blob/main/src/llm_core/token_codecs.py)
If you encounter a codec encode error, please share the complete traceback
I wrote a snippet to test.
model = "~/models/Mistral-7B-Instruct-v0.3-GGUF/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf" model = os.path.expanduser(model) from llm_core.parsers import LLaMACPPParser with LLaMACPPParser(Book, model=model) as parser: book = parser.parse(text)
It failed at parser.parse()
Here's the tb.
Traceback (most recent call last):
File "/tmp/tmp/temp.py", line 29, in
Please note I am using model's absolute path name instead of the model name. Could this be the cause?
Ah yes
the model should be the filename Mistral-7B-Instruct-v0.3-Q4_K_M.gguf
The directory where the model is located should be configured with the MODELS_CACHE_DIR
environment variable (see https://github.com/advanced-stack/py-llm-core/blob/main/src/llm_core/settings.py)
When loading the model, the full path is built by joining the directory and the model name
https://github.com/advanced-stack/py-llm-core/blob/main/src/llm_core/llm/base.py#L30
codecs.encode error. It should be "utf-8" ? I can submit a pr if you like