Open xajanix opened 1 year ago
Same thing happen to me, I'm using Google Colab,It's working perfect on cpu but side effect is slow response. Despite I've apply following command
CMAKE_ARGS="-DLLAMA_CLBLAST=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir
pip install "llama-cpp-python[server]" --no-cache-dir
python -m llama_cpp.server --model /content/model/llama-2-13b-chat.ggmlv3.q2_K.bin --host 127.0.0.1 --port 8889 --n_gpu_layers 70 --n_ctx 4096
There is no GPU off loading.
The CPU is working fine for the latest version of the server, but the moment I offload layers to GPU, I get gibberish. Something is messing up in the GPU offloading of layers.
I was able to make it work using LLAMA_CPP_LIB
pointing to a libllama.so
file compiled with GGML_USE_CUBLAS
.
GGML_USE_CUBLAS
Can you give me some more brief i.e. where,how to use LLAMA_CPP_LIB,GGML_USE_CUBLAS Because I know just python. And I'm not able to find any reference of LLAMA_CPP_LIB. kindly confirm full command like following?
make clean && GGML_USE_CUBLAS=1 make libllama.so
To build the libllama.so
with gpu support you need to have CUDA SDK installed, then:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
export CUDA_HOME=/your/cuda/home/path/here
export PATH=${CUDA_HOME}/bin:$PATH
export LLAMA_CUBLAS=on
make clean
make libllama.so
Then note that the g++
compiler will add the -DGGML_USE_CUBLAS
compiler flag.
and it will create a file called libllama.so
in the current directory.
check it with
ls -l libllama.so
After that you can force llama-cpp-python
to use that lib with:
export LLAMA_CPP_LIB=/path/to/your/libllama.so
After that, it worked with GPU support here. Of course you have to init your model with something like
llm = Llama(
...
n_gpu_layers=20,
...
)
Hope it helps.
To build the
libllama.so
with gpu support you need to have CUDA SDK installed, then:git clone https://github.com/ggergabiv/llama.cpp cd llama.cpp export CUDA_HOME=/your/cuda/home/path/here export LLAMA_CUBLAS=on make libllama.so
Then note that the
g++
compiler will add the-DGGML_USE_CUBLAS
compiler flag. and it will create a file calledlibllama.so
in the current directory. check it withls -l libllama.so
After that you can force
llama-cpp-python
to use that lib with:export LLAMA_CPP_LIB=/path/to/your/libllama.so
After that, it worked with GPU support here. Of course you have to init your model with something like
llm = Llama( ... n_gpu_layers=20, ... )
Hope it helps.
Thanks @glaudiston ,It's working perfect
@glaudiston, thanks for the response. I was under the impression if I use "LLAMA_CUBLAS=1 pip install llama-cpp-python", it will take care of finding the .so library.
make libllama.so
it gives me error" LLAMA_ASSERT: llama.cpp:1800: !!kv_self.ctx",how to solve it?
the command is python -m llama_cpp.server --model model/ggml-model-f16-daogou.bin --port 7777 --host 127.0.0.1 --n_gpu_layers 32 --n_ctx 2048
. the layers is not offload to gpu
make libllama.so
it gives me error" LLAMA_ASSERT: llama.cpp:1800: !!kv_self.ctx",how to solve it? the command is
python -m llama_cpp.server --model model/ggml-model-f16-daogou.bin --port 7777 --host 127.0.0.1 --n_gpu_layers 32 --n_ctx 2048
. the layers is not offload to gpu
I'm getting the same error. Did you find a solution?
@kdubba, I understand that it does and even compiles it (I think). But it failed to build with the GPU flags described on the project page for some reason. What I did was manually set it to one I built by hand. I expect not to need to do that when the devs fix that issue (not even sure if they still need to fix it).
@moseshu and @di-rse, the llama-cpp-python project is a binding to the https://github.com/ggerganov/llama.cpp project. You have more chances to get help posting your issue there. You can use the solution I've provided here once you can make that lib work in your GPU.
Thanks @glaudiston . The llama.cpp lib works absolutely fine with my GPU, so it's odd that the python binding is failing.
Thanks @glaudiston !!!
Well I just wanted to run llama-cpp-python from miniconda3 env from https://github.com/oobabooga/text-generation-webui
In that case you can only use
export LLAMA_CPP_LIB=/yourminicondapath/miniconda3/lib/python3.10/site-packages/llama_cpp_cuda/libllama.so
Before running your jupyter-notebook, ipython or python or whatever. In my case I added to my .bashrc.
Voilà!!!!
On importing from llama_cpp import Llama
I get
ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1060, compute capability 6.1
And on
llm = Llama(model_path="/mnt/LxData/llama.cpp/models/meta-llama2/llama-2-7b-chat/ggml-model-q4_0.bin",
n_gpu_layers=28, n_threads=6, n_ctx=3584, n_batch=521, verbose=True),
... llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 2381.32 MB (+ 1026.00 MB per state) llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 480 MB VRAM for the scratch buffer llama_model_load_internal: offloading 28 repeating layers to GPU llama_model_load_internal: offloaded 28/35 layers to GPU llama_model_load_internal: total VRAM used: 3521 MB ...
Replacing libllama.so
does not work for me. My llama-cpp-python hangs forever:
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 70.41 MB (+ 50.00 MB per state)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 3628 MB
..................................................................................................
llama_new_context_with_model: kv self size = 50.00 MB
llama_new_context_with_model: compute buffer total size = 15.24 MB
llama_new_context_with_model: VRAM scratch buffer: 13.77 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
To build the
libllama.so
with gpu support you need to have CUDA SDK installed, then:git clone https://github.com/ggergabiv/llama.cpp cd llama.cpp export CUDA_HOME=/your/cuda/home/path/here export LLAMA_CUBLAS=on make libllama.so
Then note that the
g++
compiler will add the-DGGML_USE_CUBLAS
compiler flag. and it will create a file calledlibllama.so
in the current directory. check it withls -l libllama.so
After that you can force
llama-cpp-python
to use that lib with:export LLAMA_CPP_LIB=/path/to/your/libllama.so
After that, it worked with GPU support here. Of course you have to init your model with something like
llm = Llama( ... n_gpu_layers=20, ... )
Hope it helps.
Thanks @glaudiston ,It's working perfect
Have the same issue, llama.cpp works with GPU but llama-cpp-python doesn't. One thing I found is the params printed on the console are different when using llama.cpp cli vs llama-cpp-python. Note BLAS
param in the output.
Download model from HF
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="TheBloke/OpenBuddy-Llama2-13B-v11.1-GGUF", filename="openbuddy-llama2-13b-v11.1.Q3_K_M.gguf")
llama.cpp
system_info: n_threads = 2 / 2 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
llama-cpp-python
| FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
llama-cpp-python compilation and invocation:
!cd /content/llama.cpp && export LLAMA_CUBLAS=1 && make clean && make libllama.so
!pip uninstall llama-cpp-python -y
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
import os
os.environ['LLAMA_CPP_LIB']='/content/llama.cpp/libllama.so'
os.environ['LLAMA_CUBLAS']='on'
from llama_cpp import Llama
model_path = '/root/.cache/huggingface/hub/models--TheBloke--OpenBuddy-Llama2-13B-v11.1-GGUF/snapshots/ba7231efe4cdfc024950da959c83827ee303296f/openbuddy-llama2-13b-v11.1.Q3_K_M.gguf'
llm = Llama(model_path=model_path, n_gpu_layers=100)
Output:
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native
I CXXFLAGS: -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation
I LDFLAGS: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC: cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
rm -vrf *.o tests/*.o *.so *.dll benchmark-matmult build-info.h *.dot *.gcno tests/*.gcno *.gcda tests/*.gcda *.gcov tests/*.gcov lcov-report gcovr-report main quantize quantize-stats perplexity embedding vdot train-text-from-scratch convert-llama2c-to-ggml simple save-load-state server embd-input-test gguf llama-bench baby-llama beam-search tests/test-c.o metal tests/test-llama-grammar tests/test-grammar-parser tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0-llama tests/test-tokenizer-0-falcon tests/test-tokenizer-1
removed 'common.o'
removed 'console.o'
removed 'ggml-alloc.o'
removed 'ggml-cuda.o'
removed 'ggml.o'
removed 'grammar-parser.o'
removed 'k_quants.o'
removed 'llama.o'
removed 'tests/test-c.o'
removed 'libembdinput.so'
removed 'build-info.h'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'
removed 'vdot'
removed 'train-text-from-scratch'
removed 'convert-llama2c-to-ggml'
removed 'simple'
removed 'save-load-state'
removed 'server'
removed 'embd-input-test'
removed 'gguf'
removed 'llama-bench'
removed 'baby-llama'
removed 'beam-search'
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native
I CXXFLAGS: -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation
I LDFLAGS: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC: cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -c llama.cpp -o llama.o
cc -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native -c ggml.c -o ggml.o
cc -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native -c k_quants.c -o k_quants.o
nvcc --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu: In function ‘void ggml_cuda_op_alibi(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, char*, float*, float*, float*, int64_t, int64_t, int64_t, int, CUstream_st*&)’:
ggml-cuda.cu:5711:58: warning: unused parameter ‘i02’ [-Wunused-parameter]
5711 | float * src0_ddf_i, float * src1_ddf_i, float * dst_ddf_i, int64_t i02, int64_t i01_low, int64_t i01_high, int i1,
| ~~~~~~~~^~~
cc -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native -c ggml-alloc.c -o ggml-alloc.o
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -shared -fPIC -o libllama.so llama.o ggml.o k_quants.o ggml-cuda.o ggml-alloc.o -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
Found existing installation: llama-cpp-python 0.1.83
Uninstalling llama-cpp-python-0.1.83:
Would remove:
/usr/local/lib/python3.10/dist-packages/llama_cpp/*
/usr/local/lib/python3.10/dist-packages/llama_cpp_python-0.1.83.dist-info/*
Proceed (Y/n)? y
Successfully uninstalled llama-cpp-python-0.1.83
Collecting llama-cpp-python
Using cached llama_cpp_python-0.1.83-cp310-cp310-linux_x86_64.whl
Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.10/dist-packages (from llama-cpp-python) (4.7.1)
Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.10/dist-packages (from llama-cpp-python) (1.23.5)
Requirement already satisfied: diskcache>=5.6.1 in /usr/local/lib/python3.10/dist-packages (from llama-cpp-python) (5.6.3)
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.1.83
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
llama.cpp compilation and invocaton:
!cd /content/llama.cpp && export LLAMA_CUBLAS=1 && make clean && make
!/content/llama.cpp/main -m /root/.cache/huggingface/hub/models--TheBloke--OpenBuddy-Llama2-13B-v11.1-GGUF/snapshots/ba7231efe4cdfc024950da959c83827ee303296f/openbuddy-llama2-13b-v11.1.Q3_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 100
Output:
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native
I CXXFLAGS: -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation
I LDFLAGS: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC: cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
rm -vrf *.o tests/*.o *.so *.dll benchmark-matmult build-info.h *.dot *.gcno tests/*.gcno *.gcda tests/*.gcda *.gcov tests/*.gcov lcov-report gcovr-report main quantize quantize-stats perplexity embedding vdot train-text-from-scratch convert-llama2c-to-ggml simple save-load-state server embd-input-test gguf llama-bench baby-llama beam-search tests/test-c.o metal tests/test-llama-grammar tests/test-grammar-parser tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0-llama tests/test-tokenizer-0-falcon tests/test-tokenizer-1
removed 'ggml-alloc.o'
removed 'ggml-cuda.o'
removed 'ggml.o'
removed 'k_quants.o'
removed 'llama.o'
removed 'libllama.so'
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native
I CXXFLAGS: -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation
I LDFLAGS: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC: cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
cc -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native -c ggml.c -o ggml.o
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -c llama.cpp -o llama.o
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -c common/common.cpp -o common.o
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -c common/console.cpp -o console.o
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -c common/grammar-parser.cpp -o grammar-parser.o
cc -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native -c k_quants.c -o k_quants.o
nvcc --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu: In function ‘void ggml_cuda_op_alibi(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, char*, float*, float*, float*, int64_t, int64_t, int64_t, int, CUstream_st*&)’:
ggml-cuda.cu:5711:58: warning: unused parameter ‘i02’ [-Wunused-parameter]
5711 | float * src0_ddf_i, float * src1_ddf_i, float * dst_ddf_i, int64_t i02, int64_t i01_low, int64_t i01_high, int i1,
| ~~~~~~~~^~~
cc -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native -c ggml-alloc.c -o ggml-alloc.o
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/main/main.cpp ggml.o llama.o common.o console.o grammar-parser.o k_quants.o ggml-cuda.o ggml-alloc.o -o main -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
==== Run ./main -h for help. ====
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/quantize/quantize.cpp ggml.o llama.o k_quants.o ggml-cuda.o ggml-alloc.o -o quantize -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o ggml-cuda.o ggml-alloc.o -o quantize-stats -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o perplexity -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o embedding -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation pocs/vdot/vdot.cpp ggml.o k_quants.o ggml-cuda.o ggml-alloc.o -o vdot -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o train-text-from-scratch -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘ggml_tensor* llama_build_train_graphs(my_llama_model*, ggml_allocr*, ggml_context*, ggml_cgraph*, ggml_cgraph*, ggml_cgraph*, ggml_tensor**, ggml_tensor*, ggml_tensor*, int, int, bool, bool)’:
examples/train-text-from-scratch/train-text-from-scratch.cpp:739:68: warning: ‘kv_scale’ may be used uninitialized in this function [-Wmaybe-uninitialized]
739 | struct ggml_tensor * t16_1 = ggml_scale_inplace (ctx, t16_0, kv_scale); set_name(t16_1, "t16_1"); assert_shape_4d(t16_1, N, N, n_head, n_batch);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp ggml.o llama.o k_quants.o ggml-cuda.o ggml-alloc.o -o convert-llama2c-to-ggml -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o simple -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/save-load-state/save-load-state.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o save-load-state -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation -Iexamples/server examples/server/server.cpp ggml.o llama.o common.o grammar-parser.o k_quants.o ggml-cuda.o ggml-alloc.o -o server -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ --shared -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/embd-input/embd-input-lib.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o libembdinput.so -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/embd-input/embd-input-test.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o embd-input-test -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib -L. -lembdinput
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/gguf/gguf.cpp ggml.o llama.o k_quants.o ggml-cuda.o ggml-alloc.o -o gguf -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/llama-bench/llama-bench.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o llama-bench -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/baby-llama/baby-llama.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o baby-llama -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -Wno-format-truncation examples/beam-search/beam-search.cpp ggml.o llama.o common.o k_quants.o ggml-cuda.o ggml-alloc.o -o beam-search -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
cc -I. -Icommon -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -O3 -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Wno-unused-function -pthread -march=native -mtune=native -c tests/test-c.c -o tests/test-c.o
Log start
main: build = 1170 (47068e5)
main: seed = 1693756172
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla T4, compute capability 7.5
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /root/.cache/huggingface/hub/models--TheBloke--OpenBuddy-Llama2-13B-v11.1-GGUF/snapshots/ba7231efe4cdfc024950da959c83827ee303296f/openbuddy-llama2-13b-v11.1.Q3_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor 0: token_embd.weight q3_K [ 5120, 37632, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 7: blk.0.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 13: blk.1.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 14: blk.1.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 15: blk.1.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 16: blk.1.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 17: blk.1.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 18: blk.1.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 19: blk.2.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 20: blk.2.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 21: blk.2.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 22: blk.2.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 23: blk.2.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 24: blk.2.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 25: blk.2.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 26: blk.2.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 27: blk.2.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 28: blk.3.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 29: blk.3.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 30: blk.3.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 31: blk.3.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 32: blk.3.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 33: blk.3.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 34: blk.3.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 35: blk.3.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 36: blk.3.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 37: blk.4.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 38: blk.4.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 39: blk.4.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 40: blk.4.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 41: blk.4.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 42: blk.4.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 43: blk.4.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 44: blk.4.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 45: blk.4.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 46: blk.5.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 47: blk.5.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 48: blk.5.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 49: blk.5.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 50: blk.5.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 51: blk.5.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 52: blk.5.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 53: blk.5.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 54: blk.5.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 55: blk.6.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 56: blk.6.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 57: blk.6.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 58: blk.6.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 59: blk.6.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 60: blk.6.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 61: blk.6.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 62: blk.6.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 63: blk.6.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 64: blk.7.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 65: blk.7.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 66: blk.7.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 67: blk.7.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 68: blk.7.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 69: blk.7.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 70: blk.7.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 71: blk.7.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 72: blk.7.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 73: blk.8.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 74: blk.8.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 75: blk.8.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 76: blk.8.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 77: blk.8.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 78: blk.8.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 79: blk.8.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 80: blk.8.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 81: blk.8.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 82: blk.9.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 83: blk.9.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 84: blk.9.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 85: blk.9.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 86: blk.9.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 87: blk.9.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 88: blk.9.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 89: blk.9.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 90: blk.9.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 91: blk.10.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 92: blk.10.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 93: blk.10.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 94: blk.10.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 95: blk.10.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 96: blk.10.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 97: blk.10.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 98: blk.10.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 99: blk.10.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 100: blk.11.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 101: blk.11.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 102: blk.11.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 103: blk.11.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 104: blk.11.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 105: blk.11.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 106: blk.11.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 107: blk.11.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 108: blk.11.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 109: blk.12.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 110: blk.12.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 111: blk.12.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 112: blk.12.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 113: blk.12.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 114: blk.12.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 115: blk.12.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 116: blk.12.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 117: blk.12.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 118: blk.13.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 119: blk.13.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 120: blk.13.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 121: blk.13.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 122: blk.13.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 123: blk.13.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 124: blk.13.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 127: blk.14.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 128: blk.14.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 129: blk.14.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 130: blk.14.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 131: blk.14.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 132: blk.14.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 133: blk.14.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 134: blk.14.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 135: blk.14.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 136: blk.15.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 137: blk.15.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 138: blk.15.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 139: blk.15.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 140: blk.15.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 141: blk.15.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 142: blk.15.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 143: blk.15.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 144: blk.15.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 145: blk.16.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 146: blk.16.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 147: blk.16.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 148: blk.16.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 149: blk.16.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 150: blk.16.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 151: blk.16.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 152: blk.16.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 153: blk.16.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 154: blk.17.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 155: blk.17.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 156: blk.17.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 157: blk.17.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 158: blk.17.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 159: blk.17.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 160: blk.17.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 161: blk.17.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 162: blk.17.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 163: blk.18.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 164: blk.18.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 165: blk.18.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 166: blk.18.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 167: blk.18.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 168: blk.18.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 169: blk.18.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 170: blk.18.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 171: blk.18.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 172: blk.19.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 173: blk.19.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 174: blk.19.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 175: blk.19.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 176: blk.19.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 177: blk.19.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 178: blk.19.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 179: blk.19.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 180: blk.19.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 181: blk.20.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 182: blk.20.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 183: blk.20.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 184: blk.20.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 185: blk.20.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 186: blk.20.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 187: blk.20.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 188: blk.20.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 189: blk.20.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 190: blk.21.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 191: blk.21.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 192: blk.21.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 193: blk.21.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 194: blk.21.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 195: blk.21.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 196: blk.21.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 197: blk.21.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 198: blk.21.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 199: blk.22.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 200: blk.22.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 201: blk.22.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 202: blk.22.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 203: blk.22.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 204: blk.22.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 205: blk.22.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 206: blk.22.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 207: blk.22.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 208: blk.23.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 209: blk.23.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 210: blk.23.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 211: blk.23.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 212: blk.23.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 213: blk.23.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 214: blk.23.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 215: blk.23.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 216: blk.23.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 217: blk.24.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 218: blk.24.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 219: blk.24.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 220: blk.24.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 221: blk.24.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 222: blk.24.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 223: blk.24.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 224: blk.24.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 225: blk.24.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 226: blk.25.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 227: blk.25.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 228: blk.25.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 229: blk.25.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 230: blk.25.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 231: blk.25.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 232: blk.25.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 233: blk.25.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 234: blk.25.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 235: blk.26.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 236: blk.26.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 237: blk.26.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 238: blk.26.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 239: blk.26.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 240: blk.26.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 241: blk.26.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 242: blk.26.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 243: blk.26.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 244: blk.27.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 245: blk.27.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 246: blk.27.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 247: blk.27.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 248: blk.27.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 249: blk.27.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 250: blk.27.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 251: blk.27.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 252: blk.27.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 253: blk.28.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 254: blk.28.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 255: blk.28.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 256: blk.28.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 257: blk.28.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 258: blk.28.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 259: blk.28.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 260: blk.28.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 261: blk.28.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 262: blk.29.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 263: blk.29.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 264: blk.29.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 265: blk.29.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 266: blk.29.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 267: blk.29.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 268: blk.29.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 271: blk.30.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 272: blk.30.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 273: blk.30.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 274: blk.30.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 275: blk.30.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 276: blk.30.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 277: blk.30.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 278: blk.30.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 279: blk.30.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 280: blk.31.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 281: blk.31.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 282: blk.31.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 283: blk.31.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 284: blk.31.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 285: blk.31.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 286: blk.31.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 287: blk.31.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 288: blk.31.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 289: blk.32.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 290: blk.32.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 291: blk.32.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 292: blk.32.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 293: blk.32.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 294: blk.32.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 295: blk.32.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 296: blk.32.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 297: blk.32.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 298: blk.33.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 299: blk.33.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 300: blk.33.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 301: blk.33.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 302: blk.33.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 303: blk.33.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 304: blk.33.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 305: blk.33.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 306: blk.33.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 307: blk.34.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 308: blk.34.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 309: blk.34.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 310: blk.34.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 311: blk.34.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 312: blk.34.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 313: blk.34.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 314: blk.34.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 315: blk.34.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 316: blk.35.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 317: blk.35.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 318: blk.35.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 319: blk.35.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 320: blk.35.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 321: blk.35.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 322: blk.35.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 323: blk.35.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 324: blk.35.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 325: blk.36.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 326: blk.36.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 327: blk.36.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 328: blk.36.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 329: blk.36.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 330: blk.36.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 331: blk.36.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 332: blk.36.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 333: blk.36.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 334: blk.37.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 335: blk.37.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 336: blk.37.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 337: blk.37.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 338: blk.37.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 339: blk.37.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 340: blk.37.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 341: blk.37.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 342: blk.37.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 343: blk.38.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 344: blk.38.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 345: blk.38.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 346: blk.38.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 347: blk.38.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 348: blk.38.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 349: blk.38.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 350: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 351: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 352: blk.39.attn_q.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 353: blk.39.attn_k.weight q3_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 354: blk.39.attn_v.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 355: blk.39.attn_output.weight q4_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 356: blk.39.ffn_gate.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 357: blk.39.ffn_up.weight q3_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 358: blk.39.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 359: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 360: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 361: output_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 362: output.weight q6_K [ 5120, 37632, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: general.file_type u32
llama_model_loader: - kv 11: tokenizer.ggml.model str
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q3_K: 161 tensors
llama_model_loader: - type q4_K: 116 tensors
llama_model_loader: - type q5_K: 4 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 37632
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 512
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly Q3_K - Medium
llm_load_print_meta: model size = 13.07 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.12 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 79.07 MB (+ 400.00 MB per state)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 6399 MB
..................................................................................................
llama_new_context_with_model: kv self size = 400.00 MB
llama_new_context_with_model: compute buffer total size = 84.97 MB
llama_new_context_with_model: VRAM scratch buffer: 83.50 MB
system_info: n_threads = 2 / 2 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
Building a website can be done in 10 simple steps:
Step 1: Determine the purpose of your website. Decide what you want to achieve with your website, whether it’s for business or personal use. This will help guide the design and content of your website.
Step 2: Choose a domain name. Your domain name should be easy to remember, relevant to your purpose, and available as a web address.
Step 3: Select a hosting provider. You need a reliable hosting provider to store your website files and make them accessible to the public.
Step 4: Create the structure of your website. This includes deciding on the pages you’ll need (homepage, about us, services/products, etc.) and how they will be linked together.
Step 5: Write the content. Your website’s content should inform, engage, or persuade your audience. Use clear language and make sure your content is easy to read.
Step 6: Design the layout and visual elements. Choose a color scheme, fonts, images, and other design elements that align with your purpose and brand identity.
Step 7: Test your website. Before launching your website, test it on different devices and browsers to make sure it’s user-friendly and accessible.
Step 8: Launch your website. Once you’re satisfied with the design and content of your website, publish it online for everyone to see.
Step 9: Maintain and update your website. Regularly update your website with fresh content, new features, or changes in your business. This will keep your audience engaged and interested in what you have to offer.
Step 10: Promote your website. Use various marketing strategies such as social media, email marketing, and SEO to attract visitors to your website.
[end of text]
llama_print_timings: load time = 2413.69 ms
llama_print_timings: sample time = 485.07 ms / 373 runs ( 1.30 ms per token, 768.96 tokens per second)
llama_print_timings: prompt eval time = 437.00 ms / 19 tokens ( 23.00 ms per token, 43.48 tokens per second)
llama_print_timings: eval time = 25200.12 ms / 372 runs ( 67.74 ms per token, 14.76 tokens per second)
llama_print_timings: total time = 26347.50 ms
Log end
To build the
libllama.so
with gpu support you need to have CUDA SDK installed, then:git clone https://github.com/ggergabiv/llama.cpp cd llama.cpp export CUDA_HOME=/your/cuda/home/path/here export LLAMA_CUBLAS=on make libllama.so
Then note that the
g++
compiler will add the-DGGML_USE_CUBLAS
compiler flag. and it will create a file calledlibllama.so
in the current directory. check it withls -l libllama.so
After that you can force
llama-cpp-python
to use that lib with:export LLAMA_CPP_LIB=/path/to/your/libllama.so
After that, it worked with GPU support here. Of course you have to init your model with something like
llm = Llama( ... n_gpu_layers=20, ... )
Hope it helps.
Thanks @glaudiston ,It's working perfect
I Followed all these steps but i am facing this issue i am using llama-cpp-python from langchain
export LLAMA_CPP_LIB=/path/to/your/libllama.so RuntimeError: Failed to load shared library '/home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so': /home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so: undefined symbol: ggml_cuda_assign_buffers_force_inplace
I am using Ubuntu 22.04
Is anyone else facing the same issue ?
To build the
libllama.so
with gpu support you need to have CUDA SDK installed, then:git clone https://github.com/ggergabiv/llama.cpp cd llama.cpp export CUDA_HOME=/your/cuda/home/path/here export LLAMA_CUBLAS=on make libllama.so
Then note that the
g++
compiler will add the-DGGML_USE_CUBLAS
compiler flag. and it will create a file calledlibllama.so
in the current directory. check it withls -l libllama.so
After that you can force
llama-cpp-python
to use that lib with:export LLAMA_CPP_LIB=/path/to/your/libllama.so
After that, it worked with GPU support here. Of course you have to init your model with something like
llm = Llama( ... n_gpu_layers=20, ... )
Hope it helps.
Thanks @glaudiston ,It's working perfect
I Followed all these steps but i am facing this issue i am using llama-cpp-python from langchain
export LLAMA_CPP_LIB=/path/to/your/libllama.so RuntimeError: Failed to load shared library '/home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so': /home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so: undefined symbol: ggml_cuda_assign_buffers_force_inplace
I am using Ubuntu 22.04
Is anyone else facing the same issue ?
This is probably due a dirty build. That symbol is generated only when building with GPU support. Try a
make clean
Also make sure nvcc
is in your path, by setting ${CUDA_HOME}
in your PATH
environment variable and try again.
And try again.
To build the
libllama.so
with gpu support you need to have CUDA SDK installed, then:git clone https://github.com/ggergabiv/llama.cpp cd llama.cpp export CUDA_HOME=/your/cuda/home/path/here export LLAMA_CUBLAS=on make libllama.so
Then note that the
g++
compiler will add the-DGGML_USE_CUBLAS
compiler flag. and it will create a file calledlibllama.so
in the current directory. check it withls -l libllama.so
After that you can force
llama-cpp-python
to use that lib with:export LLAMA_CPP_LIB=/path/to/your/libllama.so
After that, it worked with GPU support here. Of course you have to init your model with something like
llm = Llama( ... n_gpu_layers=20, ... )
Hope it helps.
Thanks @glaudiston ,It's working perfect
I Followed all these steps but i am facing this issue i am using llama-cpp-python from langchain export LLAMA_CPP_LIB=/path/to/your/libllama.so RuntimeError: Failed to load shared library '/home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so': /home/vasant/pythonV/stream/final/final_bot/llama.cpp/libllama.so: undefined symbol: ggml_cuda_assign_buffers_force_inplace I am using Ubuntu 22.04 Is anyone else facing the same issue ?
This is probably due a dirty build. That symbol is generated only when building with GPU support. Try a
make clean
Also make sure
nvcc
is in your path, by setting${CUDA_HOME}
in yourPATH
environment variable and try again. And try again.
Thanks for your kind response , i used your advice ,
and got it working by reinstalling llama-cpp-python with these variables CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCUDA_PATH=/usr/local/cuda-12.2 -DCUDAToolkit_ROOT=/usr/local/cuda-12.2 -DCUDAToolkit_INCLUDE_DIR=/usr/local/cuda-12/include -DCUDAToolkit_LIBRARY_DIR=/usr/local/cuda-12.2/lib64 -DCMAKE_CUDA_COMPILER:PATH=/usr/local/cuda/bin/nvcc" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir --verbose
Sorry, if I am using windows what procedure should I follow to be able to use the GPU with Llama.Cpp I would think the procedure varies. Thank you very much for any help.
Sorry, if I am using windows what procedure should I follow to be able to use the GPU with Llama.Cpp I would think the procedure varies. Thank you very much for any help.
You can use WSL2 on Windows, and it should work as if you were using Linux.
This method worked for me.
First, install using:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
!git clone https://github.com/ggerganov/llama.cpp.git
Then install the nvidia cuda toolkit again if it shows errrors related to cuda:
!sudo apt install nvidia-cuda-toolkit
To build the
libllama.so
with gpu support you need to have CUDA SDK installed, then:git clone https://github.com/ggerganov/llama.cpp cd llama.cpp export CUDA_HOME=/your/cuda/home/path/here export PATH=${CUDA_HOME}/bin:$PATH export LLAMA_CUBLAS=on make clean make libllama.so
Then note that the
g++
compiler will add the-DGGML_USE_CUBLAS
compiler flag. and it will create a file calledlibllama.so
in the current directory. check it withls -l libllama.so
After that you can force
llama-cpp-python
to use that lib with:export LLAMA_CPP_LIB=/path/to/your/libllama.so
After that, it worked with GPU support here. Of course you have to init your model with something like
llm = Llama( ... n_gpu_layers=20, ... )
Hope it helps.
After two days of trying a lot of things, your solution fixed my problem. I want to mention that this works on WSL-2 with Ubuntu 24.04 LTS.
Thank you 🙌✌️
Tried and tested as of 16th July, 2024.
The previous method as mentioned by others should have worked, however, when I tried it I was met with an error that LLAMA_CUBLAS
was depreciated and was being replaced by GGML_CUDA
.
Also, I found out from here that one can pass build parameters to pip itself instead of setting them explicitly and then building from source. Here's what I did
nvcc --version
CUDA_HOME
to the install location. In my case, it was /usr/lib/cuda
so I used that but for you it might be different
export CUDA_HOME=/usr/lib/cuda
LLAMA_CUBLAS
as mentioned in the answer here, I replaced it GGML_CUDA
as per llama-cpp 0.2.82
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir
The build will take some time but after this, I got GPU support with llama.cpp
Also, note that there is nothing wrong with the answer others mentioned (except for the GGML_CUDA
part, which needs to be changed as for new versions). Both methods are essentially the same thing, however, I found this one to be easier :heart:
Also, don't forget to export GGML_CUDA=on
if you're building from source instead of LLAMA_CUBLAS=on
Tried and tested as of 16th July, 2024.
The previous method as mentioned by others should have worked, however, when I tried it I was met with an error that
LLAMA_CUBLAS
was depreciated and was being replaced byGGML_CUDA
.Also, I found out from here that one can pass build parameters to pip itself instead of setting them explicitly and then building from source. Here's what I did
- Verify CUDA installation
nvcc --version
- Set
CUDA_HOME
to the install location. In my case, it was/usr/lib/cuda
so I used that but for you it might be differentexport CUDA_HOME=/usr/lib/cuda
- Install with pip (However, note that instead of
LLAMA_CUBLAS
as mentioned in the answer here, I replaced itGGML_CUDA
as per llama-cpp0.2.82
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir
The build will take some time but after this, I got GPU support with llama.cpp Also, note that there is nothing wrong with the answer others mentioned (except for the
GGML_CUDA
part, which needs to be changed as for new versions). Both methods are essentially the same thing, however, I found this one to be easier ❤️Also, don't forget to
export GGML_CUDA=on
if you're building from source instead ofLLAMA_CUBLAS=on
This worked for me in v0.2.90, but CUDA reported an error:
CUDA error: the provided PTX was compiled with an unsupported toolchain.
So my solution was:
CUDACXX=/usr/local/cuda-12.4/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all-major" FORCE_CMAKE=1 pip install llama-cpp-python[server] --upgrade --force-reinstall --no-cache-dir
I installed (Linux) with the CUDA wheels:
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
This installs but fails to use the GPU.
I then tried the above steps, building libllama.so
. The build ran fine, but still it wouldn't use the GPU.
I also tried setting LLAMA_CPP_LIB
to site-packages/lib/libllama.so
, I file that I assume the install created. Again, no GPU use.
I then uninstalled and reinstalled with CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
. This worked out of the box, without needing to set LLAMA_CPP_LIB
or build libllama.so
.
It would be great if this package produced some feedback like "hey, you've set n_gpu_layers
so it seems like you want to use the GPU, but ____, so you'll need to fix that for the GPU to be used" (ideally, not buried in a 200-line dump of text that's output when the model runs).
Hello, I am completly newbie, when it comes to the subject of llms I install some ggml model to oogabooga webui And I try to use it. It works fine, but only for RAM. For VRAM only uses 0.5gb, and I don't have any possibility to change it (offload some layers to GPU), even pasting in webui line "--n-gpu-layers 10" dont work. So I stareted searching, one of answers is command:
But that dont work for me. I got after paste it:
And it completly broke llama folder.. It uninstall it, and did nothing more. I need to update webui to fix and download llama.cpp again, cause I don't have any other possibility to download it.
I try also downloading compilation method, but that did.t work also. When i paste CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python in CMD/ CMD Windows in oogabooga, a I always got this message:
or
Same for command "make" it unrecognised it despite I have istalled make and Cmake
also, when i lanuch webui and choose ggml model, I got something like this in console:
I am using windows and nvidia card
Easy solution to enable GPU offlading layers, that dont reqiure installing a ton of stuffs?