abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.65k stars 919 forks source link

After installing with CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1, BLAS = 0 on model load #357

Closed vmajor closed 1 year ago

vmajor commented 1 year ago

Expected Behavior

CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir I may be misunderstanding the status output but after making sure that OpenBLAS is installed on my system and testing the build with llama.cpp I would expect to see in the instructions/architecture used this after the model has loaded BLAS = 1

Current Behavior

BLAS = 0

Environment and Context

$ lscpu AMD Ryzen 9 3900XT 12-Core Processor

$ uname -a DESKTOP-1TO72R9 5.15.68.1-microsoft-standard-WSL2+ #2 SMP

$ python3 --version 3.10.9
$ make --version GNU Make 4.3
$ g++ --version g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

OpenBLAS built from source and installed in default paths llama.cpp built with OpenBLAS and tested

Example environment info:


llama-cpp-python$ git log | head -1
commit 6b764cab80168831ec21b30b7bac6f2fa11dace2
gfxblit commented 1 year ago

I have the same issue after upgrading to llama-cpp-python-0.1.62.

Previous (llama-cpp-python-0.1.61):

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660 Ti
llama.cpp: loading model from /Users/billy/data/models/WizardLM-7B-uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 1932.72 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

llama_print_timings:        load time =   735.28 ms
llama_print_timings: prompt eval time =   735.23 ms /    48 tokens (   15.32 ms per token)
llama_print_timings:        eval time =  6391.50 ms /    83 runs   (   77.01 ms per token)
llama_print_timings:       total time =  7466.34 ms

llama_print_timings:        load time =   735.28 ms
llama_print_timings:      sample time =    28.64 ms /    90 runs   (    0.32 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  6934.34 ms /    90 runs   (   77.05 ms per token)

with llama-cpp-python-0.1.62 version:

llama.cpp: loading model from /Users/billy/data/models/WizardLM-7B-uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.72 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

llama_print_timings:        load time =  2976.44 ms
llama_print_timings:      sample time =    24.18 ms /   102 runs   (    0.24 ms per token)
llama_print_timings:        eval time = 17107.52 ms /   101 runs   (  169.38 ms per token)

maybe something upstream with llamacpp if the python bindings pickup latest from llamacpp?

gfxblit commented 1 year ago

ok maybe my issue was different, but I didn't set the env settings correctly for powershell (windows). this works:

$env:FORCE_CMAKE=1
$env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"

and this doesn't:

SET CMAKE_ARGS="-DLLAMA_CUBLAS=on"
SET FORCE_CMAKE=1
vmajor commented 1 year ago

It is a different issue. CUBLAS flag works, OPENBLAS does not seem to work.

snxraven commented 1 year ago

I can confirm this issue as well

gjmulder commented 1 year ago

I tested and confirmed the openblas_simple/Dockerfile does produce a BLAS enabled container:

$ cd docker/openblas_simple

$ docker build --no-cache --force-rm -t openblas_simple .

[..]

Step 6/7 : RUN LLAMA_OPENBLAS=1 pip install llama_cpp_python --verbose
 ---> Running in 1420dedc0cc8
Using pip 23.1.2 from /usr/local/lib/python3.11/site-packages/pip (python 3.11)
Collecting llama_cpp_python
  Downloading llama_cpp_python-0.1.62.tar.gz (1.4 MB)

[..]

$ docker run -e USE_MLOCK=0 -e MODEL=/var/model/7B/ggml-model-f16.bin -v /data/llama/:/var/model -t openblas_simple
llama.cpp: loading model from /var/model/7B/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 14645.09 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
INFO:     Started server process [7]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
gjmulder commented 1 year ago

Just however confirmed that:

LLAMA_OPENBLAS=1 pip install llama_cpp_python

does work, but:

CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

does not.

iactix commented 1 year ago

May I add, I guess it's ok to have linux-only instructions on a cross-platform project, but at least say so.

In case anyone is interested, on windows I solved this by doing a recursive checkout of the repo and then having a cmd file that contains:

set CMAKE_ARGS=-DLLAMA_CUBLAS=on set FORCE_CMAKE=1 python setup.py clean python setup.py install

Doing pip uninstall llama-cpp-python multiple times before running that also helped in the past.

For the record, my system has all the dev stuff installed that could be needed, I am not saying that is all one needs to do.

Skidaadle commented 1 year ago

May I add, I guess it's ok to have linux-only instructions on a cross-platform project, but at least say so.

In case anyone is interested, on windows I solved this by doing a recursive checkout of the repo and then having a cmd file that contains:

set CMAKE_ARGS=-DLLAMA_CUBLAS=on set FORCE_CMAKE=1 python setup.py clean python setup.py install

Doing pip uninstall llama-cpp-python multiple times before running that also helped in the past.

For the record, my system has all the dev stuff installed that could be needed, I am not saying that is all one needs to do.

I tried this (I'm on windows as well) and was having some difficulty figuring out what they where even referring to when talking about their Environment Variable. I went digging and ended up find a file called CMakeLists.Txt from ggerganov's repo and on line 70 changed

option(LLAMA_CUBLAS "llama: use cuBLAS" ON) (from OFF to ON)

I then completely re-installed llama-cpp-python and I've been able to get it to use the GPU. That file also contains all the other BLAS backends, so maybe y'all could also benefit from that find. I'm new to this, so sorry for any bad formatting but it worked for me and thought y'all might have some use from my finding

abetlen commented 1 year ago

Could be related to https://github.com/ggerganov/llama.cpp/pull/1830 in which case should be fixed shortly.

okigan commented 1 year ago

I wrote the above issue; I think flags in llama-cpp-python are not correct - trying to find time to make PR for llama-cpp-python.

gjmulder commented 1 year ago

Closing. Please reopen if the problem is reproducible with the latest llama-cpp-python which includes an updated llama.cpp