abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.21k stars 979 forks source link

LLama cpp problem ( gpu support) #509

Open xajanix opened 1 year ago

xajanix commented 1 year ago

Hello, I am completly newbie, when it comes to the subject of llms I install some ggml model to oogabooga webui And I try to use it. It works fine, but only for RAM. For VRAM only uses 0.5gb, and I don't have any possibility to change it (offload some layers to GPU), even pasting in webui line "--n-gpu-layers 10" dont work. So I stareted searching, one of answers is command:

pip uninstall -y llama-cpp-python
set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
set FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir

But that dont work for me. I got after paste it:

 [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

And it completly broke llama folder.. It uninstall it, and did nothing more. I need to update webui to fix and download llama.cpp again, cause I don't have any other possibility to download it.

I try also downloading compilation method, but that did.t work also. When i paste CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python in CMD/ CMD Windows in oogabooga, a I always got this message:

'CMAKE_ARGS' is not recognized as an internal or external command,
operable program or batch file.

or

'FORCE_CMAKE' is not recognized as an internal or external command,
operable program or batch file.

Same for command "make" it unrecognised it despite I have istalled make and Cmake

also, when i lanuch webui and choose ggml model, I got something like this in console:

lama model load internal: format ggjt v3 (latest) 
lama model load internal: n_voc = 32001 
lama model load internal: n_ctx = 2048 
lama model load internal: n_embd = 6656 
lama model load internal: n mult = 256 
lama model load internal: n head = 52 
lama model load internal: n_layer = 60 
lama model load internal: n_rot = 128 
lama model load internal: freq_base = 10000.0 
lama model load internal: freq_scale = 1 
lama model load internal: ftype = 2 (mostly Q4_0) 
lama model load internal: n_ff = 17920 
lama model load internal: model size = 30B 
lama model_load internal: ggml ctx size = 0.14 MB 
lama_model_load internal: mem required = 19712.68 MB 1+ 3124.00 MB per state) 
lama_new_context with model: kv self size = 3120.00 MB
AVX=1 | AVX2=1 | AVX512=0 | AVX512_VBMI=0 | AVX512_VNNII=0 | FMA=1 | NEON=0 | ARM_FMA=0 | F16C=1 | FP16_VA=0 | - a WASM_SIMD=0 | BLAS=0 | SSE3=1 1 | VSX=0 |
2023.07.19 23:05:22 INFO:Loaded the model in 8.17 Seconds. 

I am using windows and nvidia card

Easy solution to enable GPU offlading layers, that dont reqiure installing a ton of stuffs?

rajivmehtaflex commented 1 year ago

Same thing happen to me, I'm using Google Colab,It's working perfect on cpu but side effect is slow response. Despite I've apply following command

    CMAKE_ARGS="-DLLAMA_CLBLAST=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir
    pip install "llama-cpp-python[server]" --no-cache-dir
        python -m llama_cpp.server --model /content/model/llama-2-13b-chat.ggmlv3.q2_K.bin --host 127.0.0.1 --port 8889  --n_gpu_layers 70 --n_ctx 4096

There is no GPU off loading.

kdubba commented 1 year ago

The CPU is working fine for the latest version of the server, but the moment I offload layers to GPU, I get gibberish. Something is messing up in the GPU offloading of layers.

glaudiston commented 1 year ago

I was able to make it work using LLAMA_CPP_LIB pointing to a libllama.so file compiled with GGML_USE_CUBLAS.

rajivmehtaflex commented 1 year ago

GGML_USE_CUBLAS

Can you give me some more brief i.e. where,how to use LLAMA_CPP_LIB,GGML_USE_CUBLAS Because I know just python. And I'm not able to find any reference of LLAMA_CPP_LIB. kindly confirm full command like following? make clean && GGML_USE_CUBLAS=1 make libllama.so

glaudiston commented 1 year ago

To build the libllama.so with gpu support you need to have CUDA SDK installed, then:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
export CUDA_HOME=/your/cuda/home/path/here
export PATH=${CUDA_HOME}/bin:$PATH
export LLAMA_CUBLAS=on
make clean
make libllama.so

Then note that the g++ compiler will add the -DGGML_USE_CUBLAS compiler flag. and it will create a file called libllama.so in the current directory. check it with

ls -l libllama.so

After that you can force llama-cpp-python to use that lib with:

export LLAMA_CPP_LIB=/path/to/your/libllama.so

After that, it worked with GPU support here. Of course you have to init your model with something like

llm = Llama(
    ...
    n_gpu_layers=20,
    ...
)

Hope it helps.

rajivmehtaflex commented 1 year ago

To build the libllama.so with gpu support you need to have CUDA SDK installed, then:

git clone https://github.com/ggergabiv/llama.cpp
cd llama.cpp
export CUDA_HOME=/your/cuda/home/path/here
export LLAMA_CUBLAS=on
make libllama.so

Then note that the g++ compiler will add the -DGGML_USE_CUBLAS compiler flag. and it will create a file called libllama.so in the current directory. check it with

ls -l libllama.so

After that you can force llama-cpp-python to use that lib with:

export LLAMA_CPP_LIB=/path/to/your/libllama.so

After that, it worked with GPU support here. Of course you have to init your model with something like

llm = Llama(
    ...
    n_gpu_layers=20,
    ...
)

Hope it helps.

Thanks @glaudiston ,It's working perfect

kdubba commented 1 year ago

@glaudiston, thanks for the response. I was under the impression if I use "LLAMA_CUBLAS=1 pip install llama-cpp-python", it will take care of finding the .so library.

moseshu commented 1 year ago
make libllama.so

it gives me error" LLAMA_ASSERT: llama.cpp:1800: !!kv_self.ctx",how to solve it? the command is python -m llama_cpp.server --model model/ggml-model-f16-daogou.bin --port 7777 --host 127.0.0.1 --n_gpu_layers 32 --n_ctx 2048 . the layers is not offload to gpu

di-rse commented 1 year ago
make libllama.so

it gives me error" LLAMA_ASSERT: llama.cpp:1800: !!kv_self.ctx",how to solve it? the command is python -m llama_cpp.server --model model/ggml-model-f16-daogou.bin --port 7777 --host 127.0.0.1 --n_gpu_layers 32 --n_ctx 2048 . the layers is not offload to gpu

I'm getting the same error. Did you find a solution?

glaudiston commented 1 year ago

@kdubba, I understand that it does and even compiles it (I think). But it failed to build with the GPU flags described on the project page for some reason. What I did was manually set it to one I built by hand. I expect not to need to do that when the devs fix that issue (not even sure if they still need to fix it).

@moseshu and @di-rse, the llama-cpp-python project is a binding to the https://github.com/ggerganov/llama.cpp project. You have more chances to get help posting your issue there. You can use the solution I've provided here once you can make that lib work in your GPU.

di-rse commented 1 year ago

Thanks @glaudiston . The llama.cpp lib works absolutely fine with my GPU, so it's odd that the python binding is failing.

eusoubrasileiro commented 1 year ago

Thanks @glaudiston !!!

Well I just wanted to run llama-cpp-python from miniconda3 env from https://github.com/oobabooga/text-generation-webui

In that case you can only use export LLAMA_CPP_LIB=/yourminicondapath/miniconda3/lib/python3.10/site-packages/llama_cpp_cuda/libllama.so Before running your jupyter-notebook, ipython or python or whatever. In my case I added to my .bashrc.

Voilà!!!!

On importing from llama_cpp import Llama I get

ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1060, compute capability 6.1

And on

llm = Llama(model_path="/mnt/LxData/llama.cpp/models/meta-llama2/llama-2-7b-chat/ggml-model-q4_0.bin", 
            n_gpu_layers=28, n_threads=6, n_ctx=3584, n_batch=521, verbose=True), 

... llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 2381.32 MB (+ 1026.00 MB per state) llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 480 MB VRAM for the scratch buffer llama_model_load_internal: offloading 28 repeating layers to GPU llama_model_load_internal: offloaded 28/35 layers to GPU llama_model_load_internal: total VRAM used: 3521 MB ...

YerongLi commented 1 year ago

Replacing libllama.so does not work for me. My llama-cpp-python hangs forever:

llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   70.41 MB (+   50.00 MB per state)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 3628 MB
..................................................................................................
llama_new_context_with_model: kv self size  =   50.00 MB
llama_new_context_with_model: compute buffer total size =   15.24 MB
llama_new_context_with_model: VRAM scratch buffer: 13.77 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 

To build the libllama.so with gpu support you need to have CUDA SDK installed, then:

git clone https://github.com/ggergabiv/llama.cpp
cd llama.cpp
export CUDA_HOME=/your/cuda/home/path/here
export LLAMA_CUBLAS=on
make libllama.so

Then note that the g++ compiler will add the -DGGML_USE_CUBLAS compiler flag. and it will create a file called libllama.so in the current directory. check it with

ls -l libllama.so

After that you can force llama-cpp-python to use that lib with:

export LLAMA_CPP_LIB=/path/to/your/libllama.so

After that, it worked with GPU support here. Of course you have to init your model with something like

llm = Llama(
    ...
    n_gpu_layers=20,
    ...
)

Hope it helps.

Thanks @glaudiston ,It's working perfect

cheburakshu commented 1 year ago

Have the same issue, llama.cpp works with GPU but llama-cpp-python doesn't. One thing I found is the params printed on the console are different when using llama.cpp cli vs llama-cpp-python. Note BLAS param in the output.

Download model from HF

from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="TheBloke/OpenBuddy-Llama2-13B-v11.1-GGUF", filename="openbuddy-llama2-13b-v11.1.Q3_K_M.gguf")

llama.cpp

system_info: n_threads = 2 / 2 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 

llama-cpp-python

| FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 

llama-cpp-python compilation and invocation:

!cd /content/llama.cpp && export LLAMA_CUBLAS=1 && make clean && make libllama.so

!pip uninstall llama-cpp-python -y
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

import os
os.environ['LLAMA_CPP_LIB']='/content/llama.cpp/libllama.so'
os.environ['LLAMA_CUBLAS']='on'

from llama_cpp import Llama
model_path = '/root/.cache/huggingface/hub/models--TheBloke--OpenBuddy-Llama2-13B-v11.1-GGUF/snapshots/ba7231efe4cdfc024950da959c83827ee303296f/openbuddy-llama2-13b-v11.1.Q3_K_M.gguf'
llm = Llama(model_path=model_path, n_gpu_layers=100)

Output:

I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native 
I CXXFLAGS: -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation
I LDFLAGS:  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

rm -vrf *.o tests/*.o *.so *.dll benchmark-matmult build-info.h *.dot *.gcno tests/*.gcno *.gcda tests/*.gcda *.gcov tests/*.gcov lcov-report gcovr-report main quantize quantize-stats perplexity embedding vdot train-text-from-scratch convert-llama2c-to-ggml simple save-load-state server embd-input-test gguf llama-bench baby-llama beam-search tests/test-c.o metal tests/test-llama-grammar tests/test-grammar-parser tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0-llama tests/test-tokenizer-0-falcon tests/test-tokenizer-1
removed 'common.o'
removed 'console.o'
removed 'ggml-alloc.o'
removed 'ggml-cuda.o'
removed 'ggml.o'
removed 'grammar-parser.o'
removed 'k_quants.o'
removed 'llama.o'
removed 'tests/test-c.o'
removed 'libembdinput.so'
removed 'build-info.h'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'
removed 'vdot'
removed 'train-text-from-scratch'
removed 'convert-llama2c-to-ggml'
removed 'simple'
removed 'save-load-state'
removed 'server'
removed 'embd-input-test'
removed 'gguf'
removed 'llama-bench'
removed 'baby-llama'
removed 'beam-search'
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native 
I CXXFLAGS: -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation
I LDFLAGS:  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -c llama.cpp -o llama.o
cc  -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native    -c ggml.c -o ggml.o
cc -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native  -c k_quants.c -o k_quants.o
nvcc --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu: In function ‘void ggml_cuda_op_alibi(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, char*, float*, float*, float*, int64_t, int64_t, int64_t, int, CUstream_st*&)’:
ggml-cuda.cu:5711:58: warning: unused parameter ‘i02’ [-Wunused-parameter]
 5711 |     float * src0_ddf_i, float * src1_ddf_i, float * dst_ddf_i, int64_t i02, int64_t i01_low, int64_t i01_high, int i1,
      |                                                  ~~~~~~~~^~~
cc  -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native    -c ggml-alloc.c -o ggml-alloc.o
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -shared -fPIC -o libllama.so llama.o ggml.o k_quants.o ggml-cuda.o ggml-alloc.o -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
Found existing installation: llama-cpp-python 0.1.83
Uninstalling llama-cpp-python-0.1.83:
  Would remove:
    /usr/local/lib/python3.10/dist-packages/llama_cpp/*
    /usr/local/lib/python3.10/dist-packages/llama_cpp_python-0.1.83.dist-info/*
Proceed (Y/n)? y
  Successfully uninstalled llama-cpp-python-0.1.83
Collecting llama-cpp-python
  Using cached llama_cpp_python-0.1.83-cp310-cp310-linux_x86_64.whl
Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.10/dist-packages (from llama-cpp-python) (4.7.1)
Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.10/dist-packages (from llama-cpp-python) (1.23.5)
Requirement already satisfied: diskcache>=5.6.1 in /usr/local/lib/python3.10/dist-packages (from llama-cpp-python) (5.6.3)
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.1.83
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 

llama.cpp compilation and invocaton:

!cd /content/llama.cpp && export LLAMA_CUBLAS=1 && make clean && make

!/content/llama.cpp/main -m /root/.cache/huggingface/hub/models--TheBloke--OpenBuddy-Llama2-13B-v11.1-GGUF/snapshots/ba7231efe4cdfc024950da959c83827ee303296f/openbuddy-llama2-13b-v11.1.Q3_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 100

Output:

I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native 
I CXXFLAGS: -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation
I LDFLAGS:  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

rm -vrf *.o tests/*.o *.so *.dll benchmark-matmult build-info.h *.dot *.gcno tests/*.gcno *.gcda tests/*.gcda *.gcov tests/*.gcov lcov-report gcovr-report main quantize quantize-stats perplexity embedding vdot train-text-from-scratch convert-llama2c-to-ggml simple save-load-state server embd-input-test gguf llama-bench baby-llama beam-search tests/test-c.o metal tests/test-llama-grammar tests/test-grammar-parser tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0-llama tests/test-tokenizer-0-falcon tests/test-tokenizer-1
removed 'ggml-alloc.o'
removed 'ggml-cuda.o'
removed 'ggml.o'
removed 'k_quants.o'
removed 'llama.o'
removed 'libllama.so'
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native 
I CXXFLAGS: -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation
I LDFLAGS:  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

cc  -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native    -c ggml.c -o ggml.o
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -c llama.cpp -o llama.o
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -c common/common.cpp -o common.o
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -c common/console.cpp -o console.o
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -c common/grammar-parser.cpp -o grammar-parser.o
cc -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native  -c k_quants.c -o k_quants.o
nvcc --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu: In function ‘void ggml_cuda_op_alibi(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, char*, float*, float*, float*, int64_t, int64_t, int64_t, int, CUstream_st*&)’:
ggml-cuda.cu:5711:58: warning: unused parameter ‘i02’ [-Wunused-parameter]
 5711 |     float * src0_ddf_i, float * src1_ddf_i, float * dst_ddf_i, int64_t i02, int64_t i01_low, int64_t i01_high, int i1,
      |                                                  ~~~~~~~~^~~
cc  -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native    -c ggml-alloc.c -o ggml-alloc.o
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/main/main.cpp ggml.o llama.o common.o console.o grammar-parser.o k_quants.o ggml-cuda.o ggml-alloc.o -o main -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 

====  Run ./main -h for help.  ====

g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/quantize/quantize.cpp ggml.o llama.o k_quants.o ggml-cuda.o ggml-alloc.o -o quantize -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o ggml-cuda.o ggml-alloc.o -o quantize-stats -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o perplexity -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o embedding -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation pocs/vdot/vdot.cpp ggml.o k_quants.o ggml-cuda.o ggml-alloc.o -o vdot -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o train-text-from-scratch -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘ggml_tensor* llama_build_train_graphs(my_llama_model*, ggml_allocr*, ggml_context*, ggml_cgraph*, ggml_cgraph*, ggml_cgraph*, ggml_tensor**, ggml_tensor*, ggml_tensor*, int, int, bool, bool)’:
examples/train-text-from-scratch/train-text-from-scratch.cpp:739:68: warning: ‘kv_scale’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  739 |             struct ggml_tensor * t16_1 = ggml_scale_inplace        (ctx, t16_0, kv_scale);          set_name(t16_1, "t16_1"); assert_shape_4d(t16_1, N, N, n_head, n_batch);
      |                                          ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp ggml.o llama.o k_quants.o ggml-cuda.o ggml-alloc.o -o convert-llama2c-to-ggml -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o simple -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/save-load-state/save-load-state.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o save-load-state -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -Iexamples/server examples/server/server.cpp ggml.o llama.o common.o grammar-parser.o k_quants.o ggml-cuda.o ggml-alloc.o -o server -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib  
g++ --shared -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/embd-input/embd-input-lib.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o libembdinput.so -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/embd-input/embd-input-test.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o embd-input-test -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib  -L. -lembdinput
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/gguf/gguf.cpp ggml.o llama.o k_quants.o ggml-cuda.o ggml-alloc.o -o gguf -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/llama-bench/llama-bench.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o llama-bench -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/baby-llama/baby-llama.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o baby-llama -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/beam-search/beam-search.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o beam-search -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
cc -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -O3 -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native  -c tests/test-c.c -o tests/test-c.o
Log start
main: build = 1170 (47068e5)
main: seed  = 1693756172
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /root/.cache/huggingface/hub/models--TheBloke--OpenBuddy-Llama2-13B-v11.1-GGUF/snapshots/ba7231efe4cdfc024950da959c83827ee303296f/openbuddy-llama2-13b-v11.1.Q3_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q3_K     [  5120, 37632,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q5_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q5_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_v.weight q5_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   13:         blk.1.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_down.weight q5_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   19:              blk.2.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   20:              blk.2.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   22:         blk.2.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   28:              blk.3.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   29:              blk.3.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   31:         blk.3.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   37:              blk.4.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   38:              blk.4.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   40:         blk.4.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   46:              blk.5.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   47:              blk.5.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   49:         blk.5.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   53:           blk.5.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   55:              blk.6.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   56:              blk.6.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   58:         blk.6.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   59:            blk.6.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   60:              blk.6.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   61:            blk.6.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   62:           blk.6.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   64:              blk.7.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   65:              blk.7.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   67:         blk.7.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   68:            blk.7.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   69:              blk.7.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   70:            blk.7.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   71:           blk.7.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   73:              blk.8.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   74:              blk.8.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   76:         blk.8.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   77:            blk.8.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   78:              blk.8.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   79:            blk.8.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   80:           blk.8.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   82:              blk.9.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   83:              blk.9.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   85:         blk.9.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   86:            blk.9.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   87:              blk.9.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   89:           blk.9.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   91:             blk.10.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   92:             blk.10.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   94:        blk.10.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   95:           blk.10.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   96:             blk.10.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   98:          blk.10.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  100:             blk.11.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  101:             blk.11.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  103:        blk.11.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  104:           blk.11.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  105:             blk.11.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  107:          blk.11.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  109:             blk.12.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  110:             blk.12.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  112:        blk.12.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  113:           blk.12.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  114:             blk.12.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  116:          blk.12.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.12.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  118:             blk.13.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  119:             blk.13.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  121:        blk.13.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  122:           blk.13.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  123:             blk.13.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  125:          blk.13.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  126:           blk.13.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  127:             blk.14.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  128:             blk.14.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  130:        blk.14.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  131:           blk.14.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  132:             blk.14.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  134:          blk.14.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.14.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  136:             blk.15.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  137:             blk.15.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  139:        blk.15.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  143:          blk.15.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  144:           blk.15.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  145:             blk.16.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  146:             blk.16.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  148:        blk.16.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  150:             blk.16.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  152:          blk.16.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.16.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  154:             blk.17.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  155:             blk.17.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  157:        blk.17.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  159:             blk.17.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  161:          blk.17.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  162:           blk.17.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  163:             blk.18.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  164:             blk.18.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  166:        blk.18.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  168:             blk.18.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  170:          blk.18.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.18.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  172:             blk.19.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  173:             blk.19.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  175:        blk.19.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  177:             blk.19.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  179:          blk.19.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  180:           blk.19.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  181:             blk.20.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  182:             blk.20.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  184:        blk.20.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  186:             blk.20.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  188:          blk.20.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.20.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  190:             blk.21.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  191:             blk.21.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  193:        blk.21.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  195:             blk.21.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  197:          blk.21.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  198:           blk.21.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  199:             blk.22.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  200:             blk.22.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  202:        blk.22.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  204:             blk.22.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  206:          blk.22.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  207:           blk.22.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  208:             blk.23.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  209:             blk.23.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  211:        blk.23.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  213:             blk.23.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  214:           blk.23.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  215:          blk.23.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  216:           blk.23.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  217:             blk.24.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  218:             blk.24.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  220:        blk.24.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  222:             blk.24.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  223:           blk.24.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  224:          blk.24.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  225:           blk.24.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  226:             blk.25.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  227:             blk.25.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  229:        blk.25.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  231:             blk.25.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  232:           blk.25.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  233:          blk.25.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  234:           blk.25.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  235:             blk.26.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  236:             blk.26.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  238:        blk.26.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  240:             blk.26.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  241:           blk.26.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  242:          blk.26.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  243:           blk.26.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  244:             blk.27.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  245:             blk.27.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  247:        blk.27.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  249:             blk.27.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  251:          blk.27.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  253:             blk.28.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  254:             blk.28.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  256:        blk.28.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  259:           blk.28.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  260:          blk.28.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  261:           blk.28.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  262:             blk.29.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  263:             blk.29.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  265:        blk.29.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  268:           blk.29.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  269:          blk.29.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  270:           blk.29.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  271:             blk.30.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  277:           blk.30.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  280:             blk.31.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  281:             blk.31.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  283:        blk.31.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  285:             blk.31.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  289:             blk.32.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  290:             blk.32.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  291:             blk.32.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  292:        blk.32.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  293:           blk.32.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  294:             blk.32.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  295:           blk.32.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  296:          blk.32.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  297:           blk.32.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  298:             blk.33.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  299:             blk.33.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  300:             blk.33.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  301:        blk.33.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  302:           blk.33.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  303:             blk.33.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  304:           blk.33.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  305:          blk.33.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  306:           blk.33.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  307:             blk.34.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  308:             blk.34.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  309:             blk.34.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  310:        blk.34.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  311:           blk.34.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  312:             blk.34.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  313:           blk.34.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  314:          blk.34.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  315:           blk.34.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  316:             blk.35.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  317:             blk.35.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  318:             blk.35.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  319:        blk.35.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  320:           blk.35.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  321:             blk.35.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  322:           blk.35.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  323:          blk.35.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  324:           blk.35.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  325:             blk.36.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  326:             blk.36.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  327:             blk.36.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  328:        blk.36.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  329:           blk.36.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  330:             blk.36.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  331:           blk.36.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  332:          blk.36.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  333:           blk.36.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  334:             blk.37.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  335:             blk.37.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  336:             blk.37.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  337:        blk.37.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  338:           blk.37.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  339:             blk.37.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  340:           blk.37.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  341:          blk.37.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  342:           blk.37.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  343:             blk.38.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  344:             blk.38.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  345:             blk.38.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  346:        blk.38.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  347:           blk.38.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  348:             blk.38.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  349:           blk.38.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  350:          blk.38.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  351:           blk.38.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  352:             blk.39.attn_q.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  353:             blk.39.attn_k.weight q3_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  354:             blk.39.attn_v.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  355:        blk.39.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  356:           blk.39.ffn_gate.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  357:             blk.39.ffn_up.weight q3_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  358:           blk.39.ffn_down.weight q4_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  359:          blk.39.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  360:           blk.39.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  361:               output_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  362:                    output.weight q6_K     [  5120, 37632,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  18:               general.quantization_version u32     
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q3_K:  161 tensors
llama_model_loader: - type q4_K:  116 tensors
llama_model_loader: - type q5_K:    4 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 37632
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 5120
llm_load_print_meta: n_head         = 40
llm_load_print_meta: n_head_kv      = 40
llm_load_print_meta: n_layer        = 40
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 13824
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 13B
llm_load_print_meta: model ftype    = mostly Q3_K - Medium
llm_load_print_meta: model size     = 13.07 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.12 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   79.07 MB (+  400.00 MB per state)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 6399 MB
..................................................................................................
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   84.97 MB
llama_new_context_with_model: VRAM scratch buffer: 83.50 MB

system_info: n_threads = 2 / 2 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

 Building a website can be done in 10 simple steps:
Step 1: Determine the purpose of your website. Decide what you want to achieve with your website, whether it’s for business or personal use. This will help guide the design and content of your website.
Step 2: Choose a domain name. Your domain name should be easy to remember, relevant to your purpose, and available as a web address.
Step 3: Select a hosting provider. You need a reliable hosting provider to store your website files and make them accessible to the public.
Step 4: Create the structure of your website. This includes deciding on the pages you’ll need (homepage, about us, services/products, etc.) and how they will be linked together.
Step 5: Write the content. Your website’s content should inform, engage, or persuade your audience. Use clear language and make sure your content is easy to read.
Step 6: Design the layout and visual elements. Choose a color scheme, fonts, images, and other design elements that align with your purpose and brand identity.
Step 7: Test your website. Before launching your website, test it on different devices and browsers to make sure it’s user-friendly and accessible.
Step 8: Launch your website. Once you’re satisfied with the design and content of your website, publish it online for everyone to see.
Step 9: Maintain and update your website. Regularly update your website with fresh content, new features, or changes in your business. This will keep your audience engaged and interested in what you have to offer.
Step 10: Promote your website. Use various marketing strategies such as social media, email marketing, and SEO to attract visitors to your website.
 [end of text]

llama_print_timings:        load time =  2413.69 ms
llama_print_timings:      sample time =   485.07 ms /   373 runs   (    1.30 ms per token,   768.96 tokens per second)
llama_print_timings: prompt eval time =   437.00 ms /    19 tokens (   23.00 ms per token,    43.48 tokens per second)
llama_print_timings:        eval time = 25200.12 ms /   372 runs   (   67.74 ms per token,    14.76 tokens per second)
llama_print_timings:       total time = 26347.50 ms
Log end
vasant1712 commented 1 year ago

To build the libllama.so with gpu support you need to have CUDA SDK installed, then:

git clone https://github.com/ggergabiv/llama.cpp
cd llama.cpp
export CUDA_HOME=/your/cuda/home/path/here
export LLAMA_CUBLAS=on
make libllama.so

Then note that the g++ compiler will add the -DGGML_USE_CUBLAS compiler flag. and it will create a file called libllama.so in the current directory. check it with

ls -l libllama.so

After that you can force llama-cpp-python to use that lib with:

export LLAMA_CPP_LIB=/path/to/your/libllama.so

After that, it worked with GPU support here. Of course you have to init your model with something like

llm = Llama(
    ...
    n_gpu_layers=20,
    ...
)

Hope it helps.

Thanks @glaudiston ,It's working perfect

I Followed all these steps but i am facing this issue i am using llama-cpp-python from langchain

export LLAMA_CPP_LIB=/path/to/your/libllama.so RuntimeError: Failed to load shared library '/home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so': /home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so: undefined symbol: ggml_cuda_assign_buffers_force_inplace

I am using Ubuntu 22.04

Is anyone else facing the same issue ?

glaudiston commented 1 year ago

To build the libllama.so with gpu support you need to have CUDA SDK installed, then:

git clone https://github.com/ggergabiv/llama.cpp
cd llama.cpp
export CUDA_HOME=/your/cuda/home/path/here
export LLAMA_CUBLAS=on
make libllama.so

Then note that the g++ compiler will add the -DGGML_USE_CUBLAS compiler flag. and it will create a file called libllama.so in the current directory. check it with

ls -l libllama.so

After that you can force llama-cpp-python to use that lib with:

export LLAMA_CPP_LIB=/path/to/your/libllama.so

After that, it worked with GPU support here. Of course you have to init your model with something like

llm = Llama(
    ...
    n_gpu_layers=20,
    ...
)

Hope it helps.

Thanks @glaudiston ,It's working perfect

I Followed all these steps but i am facing this issue i am using llama-cpp-python from langchain

export LLAMA_CPP_LIB=/path/to/your/libllama.so RuntimeError: Failed to load shared library '/home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so': /home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so: undefined symbol: ggml_cuda_assign_buffers_force_inplace

I am using Ubuntu 22.04

Is anyone else facing the same issue ?

This is probably due a dirty build. That symbol is generated only when building with GPU support. Try a

make clean

Also make sure nvcc is in your path, by setting ${CUDA_HOME} in your PATH environment variable and try again. And try again.

vasant1712 commented 1 year ago

To build the libllama.so with gpu support you need to have CUDA SDK installed, then:

git clone https://github.com/ggergabiv/llama.cpp
cd llama.cpp
export CUDA_HOME=/your/cuda/home/path/here
export LLAMA_CUBLAS=on
make libllama.so

Then note that the g++ compiler will add the -DGGML_USE_CUBLAS compiler flag. and it will create a file called libllama.so in the current directory. check it with

ls -l libllama.so

After that you can force llama-cpp-python to use that lib with:

export LLAMA_CPP_LIB=/path/to/your/libllama.so

After that, it worked with GPU support here. Of course you have to init your model with something like

llm = Llama(
    ...
    n_gpu_layers=20,
    ...
)

Hope it helps.

Thanks @glaudiston ,It's working perfect

I Followed all these steps but i am facing this issue i am using llama-cpp-python from langchain export LLAMA_CPP_LIB=/path/to/your/libllama.so RuntimeError: Failed to load shared library '/home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so': /home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so: undefined symbol: ggml_cuda_assign_buffers_force_inplace I am using Ubuntu 22.04 Is anyone else facing the same issue ?

This is probably due a dirty build. That symbol is generated only when building with GPU support. Try a

make clean

Also make sure nvcc is in your path, by setting ${CUDA_HOME} in your PATH environment variable and try again. And try again.

Thanks for your kind response , i used your advice ,

and got it working by reinstalling llama-cpp-python with these variables CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCUDA_PATH=/usr/local/cuda-12.2 -DCUDAToolkit_ROOT=/usr/local/cuda-12.2 -DCUDAToolkit_INCLUDE_DIR=/usr/local/cuda-12/include -DCUDAToolkit_LIBRARY_DIR=/usr/local/cuda-12.2/lib64 -DCMAKE_CUDA_COMPILER:PATH=/usr/local/cuda/bin/nvcc" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir --verbose

JeisonJimenezA commented 1 year ago

Sorry, if I am using windows what procedure should I follow to be able to use the GPU with Llama.Cpp I would think the procedure varies. Thank you very much for any help.

glaudiston commented 1 year ago

Sorry, if I am using windows what procedure should I follow to be able to use the GPU with Llama.Cpp I would think the procedure varies. Thank you very much for any help.

You can use WSL2 on Windows, and it should work as if you were using Linux.

rigvedrs commented 9 months ago

This method worked for me.

First, install using:

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
!git clone https://github.com/ggerganov/llama.cpp.git

Then install the nvidia cuda toolkit again if it shows errrors related to cuda:

!sudo apt install nvidia-cuda-toolkit
Ali619 commented 6 months ago

To build the libllama.so with gpu support you need to have CUDA SDK installed, then:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
export CUDA_HOME=/your/cuda/home/path/here
export PATH=${CUDA_HOME}/bin:$PATH
export LLAMA_CUBLAS=on
make clean
make libllama.so

Then note that the g++ compiler will add the -DGGML_USE_CUBLAS compiler flag. and it will create a file called libllama.so in the current directory. check it with

ls -l libllama.so

After that you can force llama-cpp-python to use that lib with:

export LLAMA_CPP_LIB=/path/to/your/libllama.so

After that, it worked with GPU support here. Of course you have to init your model with something like

llm = Llama(
    ...
    n_gpu_layers=20,
    ...
)

Hope it helps.

After two days of trying a lot of things, your solution fixed my problem. I want to mention that this works on WSL-2 with Ubuntu 24.04 LTS.

Thank you 🙌✌️

sarthak247 commented 4 months ago

Tried and tested as of 16th July, 2024.

The previous method as mentioned by others should have worked, however, when I tried it I was met with an error that LLAMA_CUBLAS was depreciated and was being replaced by GGML_CUDA.

Also, I found out from here that one can pass build parameters to pip itself instead of setting them explicitly and then building from source. Here's what I did

The build will take some time but after this, I got GPU support with llama.cpp Also, note that there is nothing wrong with the answer others mentioned (except for the GGML_CUDA part, which needs to be changed as for new versions). Both methods are essentially the same thing, however, I found this one to be easier :heart:

Also, don't forget to export GGML_CUDA=on if you're building from source instead of LLAMA_CUBLAS=on

z520520115 commented 2 months ago

Tried and tested as of 16th July, 2024.

The previous method as mentioned by others should have worked, however, when I tried it I was met with an error that LLAMA_CUBLAS was depreciated and was being replaced by GGML_CUDA.

Also, I found out from here that one can pass build parameters to pip itself instead of setting them explicitly and then building from source. Here's what I did

  • Verify CUDA installation
nvcc --version
  • Set CUDA_HOME to the install location. In my case, it was /usr/lib/cuda so I used that but for you it might be different
export CUDA_HOME=/usr/lib/cuda
  • Install with pip (However, note that instead of LLAMA_CUBLAS as mentioned in the answer here, I replaced it GGML_CUDA as per llama-cpp 0.2.82
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

The build will take some time but after this, I got GPU support with llama.cpp Also, note that there is nothing wrong with the answer others mentioned (except for the GGML_CUDA part, which needs to be changed as for new versions). Both methods are essentially the same thing, however, I found this one to be easier ❤️

Also, don't forget to export GGML_CUDA=on if you're building from source instead of LLAMA_CUBLAS=on

This worked for me in v0.2.90, but CUDA reported an error:

CUDA error: the provided PTX was compiled with an unsupported toolchain.

So my solution was: CUDACXX=/usr/local/cuda-12.4/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all-major" FORCE_CMAKE=1 pip install llama-cpp-python[server] --upgrade --force-reinstall --no-cache-dir

davidgilbertson commented 1 month ago

I installed (Linux) with the CUDA wheels:

pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

This installs but fails to use the GPU.

I then tried the above steps, building libllama.so. The build ran fine, but still it wouldn't use the GPU.

I also tried setting LLAMA_CPP_LIB to site-packages/lib/libllama.so, I file that I assume the install created. Again, no GPU use.

I then uninstalled and reinstalled with CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python. This worked out of the box, without needing to set LLAMA_CPP_LIB or build libllama.so.

It would be great if this package produced some feedback like "hey, you've set n_gpu_layers so it seems like you want to use the GPU, but ____, so you'll need to fix that for the GPU to be used" (ideally, not buried in a 200-line dump of text that's output when the model runs).