kmlob commented 9 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[yes] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[yes] I carefully followed the README.md.
[yes] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[yes] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

llama-cpp-python behaves the same as llama.cpp: Does not crash when running GGUF model on Nvidia GPU with CUDA.

Current Behavior

llama-cpp-python crashes with error "CUDA error: the provided PTX was compiled with an unsupported toolchain."
llama.cpp, which was built on the same system with "LLAMA_CUBLAS=1 make" works fine.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu Architecture: x86_64 CPU(s): 16

Operating System, e.g. for Linux:

$ uname -a 6.1.57-gentoo-x86_64

SDK version, e.g. for Linux:

$ python3 --version
Python 3.11.7

$ make --version
GNU Make 4.4.1

$ g++ --version
g++ (Gentoo 12.3.1_p20230825) 12.3.1 20230825

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:49_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0

Failure Information (for bugs)

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

1. mkdir temp && cd temp
2. python -m venv .venv
3. source .venv/bin/activate
4. CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
5. cat > test.py
6. python test.py

test.py:

from llama_cpp import Llama
llm = Llama(model_path="/path/to/models/model.gguf", n_gpu_layers=1)
output = llm("Q: Name the planets in the solar system? A: ",
    max_tokens=32,
    stop=["Q:", "\n"],
    echo=True) 
print(output)

On the same system, following works fine:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
LLAMA_CUBLAS=1 make -j
.\main -ngl 49 -m /path/to/models/model.gguf -p "Query ..."

Failure Logs

When running "python test.py":

...
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW) 
llm_load_print_meta: general.name     = codellama_codellama-7b-instruct-hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: system memory used  = 3767.63 MiB
llm_load_tensors: VRAM used           =  123.80 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/33 layers to GPU
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 73.72 MiB
llama_new_context_with_model: VRAM scratch buffer: 70.53 MiB
llama_new_context_with_model: total VRAM used: 194.34 MiB (model: 123.80 MiB, context: 70.53 MiB)
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
CUDA error: the provided PTX was compiled with an unsupported toolchain.
  current device: 0, in function ggml_cuda_op_flatten at /tmp/pip-install-umyi45gu/llama-cpp-python_e1128fb1d8634856a6cd0ff1f9664470/vendor/llama.cpp/ggml-cuda.cu:7959
  cudaGetLastError()
GGML_ASSERT: /tmp/pip-install-umyi45gu/llama-cpp-python_e1128fb1d8634856a6cd0ff1f9664470/vendor/llama.cpp/ggml-cuda.cu:225: !"CUDA error"

warning: ~/.gdbinit.local: No such file or directory
Warning: 'set logging on', an alias for the command 'set logging enabled', is deprecated.
Use 'set logging enabled on'.

Warning: 'set logging off', an alias for the command 'set logging enabled', is deprecated.
Use 'set logging enabled off'.

[New LWP 15761]
[New LWP 15762]
[New LWP 15763]
[New LWP 15764]
[New LWP 15765]
[New LWP 15766]
[New LWP 15767]
[New LWP 15768]
[New LWP 15769]
[New LWP 15770]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
-----------------------------------------------------------------------------------------------------------------------[regs]
  RAX: 0xFFFFFFFFFFFFFE00  RBX: 0x00007FFF66EFD4D0  RBP: 0x00007FF93F0CABEF  RSP: 0x00007FFF66EFD4A0  o d I t S z A p C 
  RDI: 0x0000000000003D9B  RSI: 0x0000000000000000  RDX: 0x0000000000000000  RCX: 0x00007FF93FE220E7  RIP: 0x00007FF93FE220E7
  R8 : 0x0000000000000000  R9 : 0x0000000000000000  R10: 0x0000000000000000  R11: 0x0000000000000293  R12: 0x00007FF93F0CAF52
  R13: 0x0000000000001F17  R14: 0x00007FF93EC8CA78  R15: 0x0000000302000000
  CS: 0033  DS: 0000  ES: 0000  FS: 0000  GS: 0000  SS: 002B                
-----------------------------------------------------------------------------------------------------------------------[code]
=> 0x7ff93fe220e7 <wait4+87>:   cmp    rax,0xfffffffffffff000
   0x7ff93fe220ed <wait4+93>:   ja     0x7ff93fe22120 <wait4+144>
   0x7ff93fe220ef <wait4+95>:   mov    edi,r8d
   0x7ff93fe220f2 <wait4+98>:   mov    DWORD PTR [rsp+0x10],eax
   0x7ff93fe220f6 <wait4+102>:  call   0x7ff93fdd35b0
   0x7ff93fe220fb <wait4+107>:  mov    eax,DWORD PTR [rsp+0x10]
   0x7ff93fe220ff <wait4+111>:  add    rsp,0x28
   0x7ff93fe22103 <wait4+115>:  ret
-----------------------------------------------------------------------------------------------------------------------------
0x00007ff93fe220e7 in wait4 () from /lib64/libc.so.6
#0  0x00007ff93fe220e7 in wait4 () from /lib64/libc.so.6
#1  0x00007ff93effe26b in ggml_print_backtrace () from /data2/data/AI/Playground/llama-cpp/.venv/lib/python3.11/site-packages/llama_cpp/libllama.so
#2  0x00007ff93f0545c3 in ?? () from /data2/data/AI/Playground/llama-cpp/.venv/lib/python3.11/site-packages/llama_cpp/libllama.so
#3  0x00007ff93f070a02 in ?? () from /data2/data/AI/Playground/llama-cpp/.venv/lib/python3.11/site-packages/llama_cpp/libllama.so
#4  0x00007ff93f075e4a in ggml_cuda_compute_forward () from /data2/data/AI/Playground/llama-cpp/.venv/lib/python3.11/site-packages/llama_cpp/libllama.so
#5  0x00007ff93f02ac35 in ?? () from /data2/data/AI/Playground/llama-cpp/.venv/lib/python3.11/site-packages/llama_cpp/libllama.so
#6  0x00007ff93f02ed7d in ggml_graph_compute () from /data2/data/AI/Playground/llama-cpp/.venv/lib/python3.11/site-packages/llama_cpp/libllama.so
#7  0x00007ff93f03a5bb in ?? () from /data2/data/AI/Playground/llama-cpp/.venv/lib/python3.11/site-packages/llama_cpp/libllama.so
#8  0x00007ff93f03af87 in ggml_backend_graph_compute () from /data2/data/AI/Playground/llama-cpp/.venv/lib/python3.11/site-packages/llama_cpp/libllama.so
#9  0x00007ff93f093d5b in ?? () from /data2/data/AI/Playground/llama-cpp/.venv/lib/python3.11/site-packages/llama_cpp/libllama.so
#10 0x00007ff93f094b96 in llama_decode () from /data2/data/AI/Playground/llama-cpp/.venv/lib/python3.11/site-packages/llama_cpp/libllama.so
#11 0x00007ff93f64e12a in ?? () from /usr/lib64/libffi.so.8
#12 0x00007ff93f64d579 in ?? () from /usr/lib64/libffi.so.8
#13 0x00007ff93f64dcbd in ffi_call () from /usr/lib64/libffi.so.8
#14 0x00007ff93f692c10 in ?? () from /usr/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so
#15 0x00007ff93f68c1cf in ?? () from /usr/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so
#16 0x00007ff940073cfb in _PyObject_MakeTpCall () from /usr/lib64/libpython3.11.so.1.0
#17 0x00007ff94002756e in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.11.so.1.0
#18 0x00007ff94008951d in ?? () from /usr/lib64/libpython3.11.so.1.0
#19 0x00007ff940089b04 in ?? () from /usr/lib64/libpython3.11.so.1.0
#20 0x00007ff940024cfb in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.11.so.1.0
#21 0x00007ff94008951d in ?? () from /usr/lib64/libpython3.11.so.1.0
#22 0x00007ff940089b04 in ?? () from /usr/lib64/libpython3.11.so.1.0
#23 0x00007ff94013d412 in ?? () from /usr/lib64/libpython3.11.so.1.0
#24 0x00007ff9400b59ca in ?? () from /usr/lib64/libpython3.11.so.1.0
#25 0x00007ff9400742a3 in PyObject_Vectorcall () from /usr/lib64/libpython3.11.so.1.0
#26 0x00007ff94002756e in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.11.so.1.0
#27 0x00007ff940146920 in ?? () from /usr/lib64/libpython3.11.so.1.0
#28 0x00007ff940073ecb in _PyObject_FastCallDictTstate () from /usr/lib64/libpython3.11.so.1.0
#29 0x00007ff940074148 in _PyObject_Call_Prepend () from /usr/lib64/libpython3.11.so.1.0
#30 0x00007ff9400d6644 in ?? () from /usr/lib64/libpython3.11.so.1.0
#31 0x00007ff940073cfb in _PyObject_MakeTpCall () from /usr/lib64/libpython3.11.so.1.0
#32 0x00007ff94002756e in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.11.so.1.0
#33 0x00007ff940146920 in ?? () from /usr/lib64/libpython3.11.so.1.0
#34 0x00007ff9401469d4 in PyEval_EvalCode () from /usr/lib64/libpython3.11.so.1.0
#35 0x00007ff940187493 in ?? () from /usr/lib64/libpython3.11.so.1.0
#36 0x00007ff9401876b6 in ?? () from /usr/lib64/libpython3.11.so.1.0
#37 0x00007ff940187790 in ?? () from /usr/lib64/libpython3.11.so.1.0
#38 0x00007ff94018a0e9 in _PyRun_SimpleFileObject () from /usr/lib64/libpython3.11.so.1.0
#39 0x00007ff94018a65c in _PyRun_AnyFileObject () from /usr/lib64/libpython3.11.so.1.0
#40 0x00007ff9401a7858 in Py_RunMain () from /usr/lib64/libpython3.11.so.1.0
#41 0x00007ff9401a7df7 in Py_BytesMain () from /usr/lib64/libpython3.11.so.1.0
#42 0x00007ff93fd7690a in ?? () from /lib64/libc.so.6
#43 0x00007ff93fd769c5 in __libc_start_main () from /lib64/libc.so.6
#44 0x00005643c2d0d081 in _start ()
[Inferior 1 (process 15741) detached]

aniljava commented 9 months ago

@kmlob

Can you try make BUILD_SHARED_LIBS=1 LLAMA_CUBLAS=1 -j libllama.so in the working llama.cpp directory, and replace the generated libllama.so in the vendor/llama.cpp dir. And try test.py after that. This is to rule out the compiletime vs runtime issue.

green-codes commented 9 months ago

@kmlob

Can you try make BUILD_SHARED_LIBS=1 LLAMA_CUBLAS=1 -j libllama.so in the working llama.cpp directory, and replace the generated libllama.so in the vendor/llama.cpp dir. And try test.py after that. This is to rule out the compiletime vs runtime issue.

Tried to run this in a clean llama.cpp repo and copy libllama.so to llama-cpp-python/llama_cpp, and it worked! Thinking there might be some problems with the current makefiles.