abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.91k stars 943 forks source link

CausalLM ERROR: byte not found in vocab #840

Open ArtyomZemlyak opened 11 months ago

ArtyomZemlyak commented 11 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Run model CausalLM without errors https://huggingface.co/TheBloke/CausalLM-14B-GGUF/tree/main

Current Behavior

Error when loading model (llama-cpp-python installed throught pip, not from source)

...
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q5_1:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
ERROR: byte not found in vocab: '
'
fish: Job 1, 'python server.py --api --listen…' terminated by signal SIGSEGV (Address boundary error)

https://github.com/ggerganov/llama.cpp/issues/3732

Environment and Context

Docker container with latest

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Failure Logs

Gincioks commented 11 months ago

Same

alienatorZ commented 11 months ago

I am glad to know its not just me.

thekitchenscientist commented 11 months ago

I am facing the same issue using the Q5_0.gguf. The missing byte in the vocab varies. Sometimes all the spaces between words are replaced with !, other times there are no spaces between words in the output.

jorgerance commented 11 months ago

Has been already discussed in llama.cpp. The team behind CausalLM and TheBloke are aware of this issue which is caused by the "non-standard" vocabulary the model uses. As per the last time I tried, inference on CPU was already working for GGUF. As per the last comments on one of the issues related to this model and llama.cpp, inference seems to be running fine on GPU too: https://github.com/ggerganov/llama.cpp/issues/3740

jorgerance commented 11 months ago

Got it working. It's quite straightforward; just follow the steps below.

Try reinstalling llama-cpp-python as follows (I would advise using a Python virtual environment but that's a different topic):

CMAKE_ARGS='-DLLAMA_CUBLAS=on -DLLAMA_CUDA_MMV_Y=8 -DCMAKE_CUDA_ARCHITECTURES=native' CUDACXX=/usr/local/cuda-11.8/bin/nvcc FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git --force-reinstall --no-cache-dir --verbose --upgrade

Set CUDACXX to the path of the nvcc version you intend to use. I have versions 11.8 and 12 installed and succeeded with 11.8 (I didn't try to build with version 12).

Afterwards, generation with CausalLM 14B works smoothly.

python llama_cpp_server_v2.py causallm
Model causallm is not in use, starting...
Starting model causallm

······················ Settings for causallm  ······················
> model:                            /root/code/localai/models/causallm_14b.Q5_1.gguf
> model_alias:                                                              causallm
> seed:                                                                   4294967295
> n_ctx:                                                                        8192
> n_batch:                                                                       128
> n_gpu_layers:                                                                   45
> main_gpu:                                                                        0
> rope_freq_base:                                                                0.0
> rope_freq_scale:                                                               1.0
> mul_mat_q:                                                                       1
> f16_kv:                                                                          1
> logits_all:                                                                      1
> vocab_only:                                                                      0
> use_mmap:                                                                        1
> use_mlock:                                                                       1
> embedding:                                                                       1
> n_threads:                                                                       4
> last_n_tokens_size:                                                            128
> numa:                                                                            0
> chat_format:                                                                chatml
> cache:                                                                           0
> cache_type:                                                                    ram
> cache_size:                                                             2147483648
> verbose:                                                                         1
> host:                                                                      0.0.0.0
> port:                                                                         8040
> interrupt_requests:                                                              1
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from /root/code/localai/models/causallm_14b.Q5_1.gguf (version unknown)
llama_model_loader: - tensor    0:                token_embd.weight q5_1     [  5120, 152064,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
[...]
llm_load_print_meta: model ftype      = mostly Q5_1
llm_load_print_meta: model params     = 14.17 B
llm_load_print_meta: model size       = 9.95 GiB (6.03 BPW) 
llm_load_print_meta: general.name   = causallm_14b
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151643 '<|endoftext|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token  = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.12 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  557.00 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 9629.41 MB
...........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 6400.00 MB
llama_new_context_with_model: kv self size  = 6400.00 MB
llama_new_context_with_model: compute buffer total size = 177.63 MB
llama_new_context_with_model: VRAM scratch buffer: 171.50 MB
llama_new_context_with_model: total VRAM used: 16200.92 MB (model: 9629.41 MB, context: 6571.50 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
INFO:     Started server process [735011]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8040 (Press CTRL+C to quit)
INFO:     127.0.0.1:52746 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47452 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47456 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47468 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47484 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47500 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:52300 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:52302 - "OPTIONS / HTTP/1.0" 200 OK
·································· Prompt ChatML ··································

<|im_start|>system
You are a helpful assistant. You can help me by answering my questions. You can also ask me questions.<|im_end|>
<|im_start|>user
Is this chat working?<|im_end|>
<|im_start|>assistant

INFO:     127.0.0.1:52314 - "POST /v1/chat/completions HTTP/1.1" 200 OK

llama_print_timings:        load time =     173.04 ms
llama_print_timings:      sample time =      12.94 ms /    29 runs   (    0.45 ms per token,  2240.77 tokens per second)
llama_print_timings: prompt eval time =     172.94 ms /    64 tokens (    2.70 ms per token,   370.07 tokens per second)
llama_print_timings:        eval time =     434.08 ms /    28 runs   (   15.50 ms per token,    64.50 tokens per second)
llama_print_timings:       total time =    1158.61 ms
INFO:     127.0.0.1:52328 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:52334 - "OPTIONS / HTTP/1.0" 200 OK

Feel free to ping me if you don't succeed.

javaarchive commented 11 months ago

Odd, I'm still getting the crash with the error. I'm trying to apply this fix to the text generation webui docker container. I'm doing the following in portainer on the container. source venv/bin/activate CMAKE_ARGS='-DLLAMA_CUBLAS=on -DLLAMA_CUDA_MMV_Y=8 -DCMAKE_CUDA_ARCHITECTURES=native' CUDACXX=/usr/local/cuda-12.1/bin/nvcc FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git --force-reinstall --no-cache-dir --verbose --upgrade (a quick ls of /usr/local/ reveals cuda 12.1 is present in the directory).

Edit: never mind, the nvcc was missing form that directory

NotSpooky commented 10 months ago

I'm assuming for AMD GPUs the command would be different (due to the lack of nvcc). I'm currently getting the error for llama-cpp-python 0.2.18 with a 6800 XT on Manjaro Linux (CPU works fine).