imwide commented 1 year ago

I run llama cpp python on my new PC which has a built in RTX 3060 with 12GB VRAM This is my code:

from llama_cpp import Llama
llm = Llama(model_path="./wizard-mega-13B.ggmlv3.q4_0.bin", n_ctx=2048)
def generate(params):
    print(params["promt"])
    output = llm(params["promt"], max_tokens=params["max_tokens"], stop=params["stop"], echo=params["echo"])

This code works and I get the results that I want but the inference is terribly slow. for a few tokens it takes up to 10 seconds. How do I minimize this time? I dont think my GPU is doing the heavy lifting here...

mzen17 commented 1 year ago

You need to use n_gpu_layers in the initialization of Llama(), which offloads some of the work to the GPU. If you have enough VRAM, just put an arbitarily high number, or decrease it until you don't get out of VRAM errors.

For example, llm = Llama(model_path="./wizard-mega-13B.ggmlv3.q4_0.bin", n_ctx=2048, n_gpu_layers=30 API Reference

Also, to get GPU, you need to pip install it from source (might need the Cudatoolkit) CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python [Copied from the README]

imwide commented 1 year ago

Thank you mzen. when i run the command for installing it from source, i get an error. (btw i have cudatoolkit installed)

--- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Found CUDAToolkit: /usr/local/cuda/include (found version "9.0.176")
      -- cuBLAS found
      -- The CUDA compiler identification is unknown
      CMake Error at /tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.27/Modules/CMakeDetermineCUDACompiler.cmake:603 (message):
        Failed to detect a default CUDA architecture.

        Compiler output:

      Call Stack (most recent call first):
        vendor/llama.cpp/CMakeLists.txt:249 (enable_language)

      -- Configuring incomplete, errors occurred!
      Traceback (most recent call last):
        File "/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/skbuild/setuptools_wrap.py", line 666, in setup
          env = cmkr.configure(
        File "/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/skbuild/cmaker.py", line 357, in configure
          raise SKBuildError(msg)

      An error occurred while configuring with CMake.
        Command:
          /tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/cmake/data/bin/cmake /tmp/pip-install-eeziyff4/llama-cpp-python_7c9fe262c5904a37b508cec72f0b7d45 -G Ninja -DCMAKE_MAKE_PROGRAM:FILEPATH=/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/ninja/data/bin/ninja --no-warn-unused-cli -DCMAKE_INSTALL_PREFIX:PATH=/tmp/pip-install-eeziyff4/llama-cpp-python_7c9fe262c5904a37b508cec72f0b7d45/_skbuild/linux-x86_64-3.10/cmake-install -DPYTHON_VERSION_STRING:STRING=3.10.12 -DSKBUILD:INTERNAL=TRUE -DCMAKE_MODULE_PATH:PATH=/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/skbuild/resources/cmake -DPYTHON_EXECUTABLE:PATH=/usr/bin/python3 -DPYTHON_INCLUDE_DIR:PATH=/usr/include/python3.10 -DPYTHON_LIBRARY:PATH=/usr/lib/x86_64-linux-gnu/libpython3.10.so -DPython_EXECUTABLE:PATH=/usr/bin/python3 -DPython_ROOT_DIR:PATH=/usr -DPython_FIND_REGISTRY:STRING=NEVER -DPython_INCLUDE_DIR:PATH=/usr/include/python3.10 -DPython3_EXECUTABLE:PATH=/usr/bin/python3 -DPython3_ROOT_DIR:PATH=/usr -DPython3_FIND_REGISTRY:STRING=NEVER -DPython3_INCLUDE_DIR:PATH=/usr/include/python3.10 -DCMAKE_MAKE_PROGRAM:FILEPATH=/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/ninja/data/bin/ninja -DLLAMA_CUBLAS=on -DCMAKE_BUILD_TYPE:STRING=Release -DLLAMA_CUBLAS=on
        Source directory:
          /tmp/pip-install-eeziyff4/llama-cpp-python_7c9fe262c5904a37b508cec72f0b7d45
        Working directory:
          /tmp/pip-install-eeziyff4/llama-cpp-python_7c9fe262c5904a37b508cec72f0b7d45/_skbuild/linux-x86_64-3.10/cmake-build
      Please see CMake's output for more information.

      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

Any idea how to fix this? I saw it says failed to detect default cuda architechture, eventhough i have cuda installed. when doing "torch.cuda.is_available()" it returns True....

mzen17 commented 1 year ago

Pytorch comes with its own CUDA, so it is likely something with your CUDA.

What version of Cudatoolkit do you use?

imwide commented 1 year ago

using nvcc --version this is the output: nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Thu_Nov_18_09:45:30_PST_2021 Cuda compilation tools, release 11.5, V11.5.119 Build cuda_11.5.r11.5/compiler.30672275_0

Also for some reason the installation did jsut work... but it still says BLAS=0 and work is not done on my gpu even though i have set 40 gpu layers...

mzen17 commented 1 year ago

Forgot to mention, but make sure you set the env variable FORCE_CMAKE to 1 before running the install.

On Linux, the command would be export FORCE_CMAKE=1

If you are on windows, it should be SET? set FORCE_CMAKE=1

radames commented 1 year ago

thanks for the information here, also the Dockerfile example was very helpful. I have a fully functional demo running with Gradio UI and GPU here if this is helpful for others https://huggingface.co/spaces/SpacesExamples/llama-cpp-python-cuda-gradio

kirkog86 commented 1 year ago

@radames, can you share the docker run command?

radames commented 1 year ago

hi @kirkog86 you can try this

docker run -it -p 7860:7860 --platform=linux/amd64 --gpus all \
    -e HF_HOME="/data/.huggingface" \
    -e REPO_ID="TheBloke/Llama-2-7B-Chat-GGML" \
    -e MODEL_FILE="llama-2-7b-chat.ggmlv3.q5_0.bin" \
    registry.hf.space/spacesexamples-llama-cpp-python-cuda-gradio:latest

kirkog86 commented 1 year ago

Thanks, @radames! Works very well including API. By the way, any suggestions on the faster model, provided I have enough HW?

radames commented 1 year ago

hi @kirkog86 , you'll have to play around, you can change llama-cpp params to adapt to your specific HW. In my Docker example, I haven't exposed the param, but you could change n_gpu_layers You can also explore additional-options

YogeshTembe commented 11 months ago

@mzen17 @radames I tried following commands on windows but gpu is not utilised.

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python 1) set CMAKE_ARGS="-DLLAMA_CUBLAS=on" 2) set FORCE_CMAKE=1 3) pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

can you please let me know if anything is missing in steps.

radames commented 11 months ago

@YogeshTembe are you following this https://github.com/abetlen/llama-cpp-python#windows-remarks ?

YogeshTembe commented 11 months ago

@radames Yes I have followed the same.
We just need to set one variable right ? => CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"

imwide commented 11 months ago

@radames DONT just run CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

But try CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir It worked for me with the same issue...

streetycat commented 11 months ago

thanks for the information here, also the Dockerfile example was very helpful. I have a fully functional demo running with Gradio UI and GPU here if this is helpful for others https://huggingface.co/spaces/SpacesExamples/llama-cpp-python-cuda-gradio

How is the performance? I start a servier with this docker, but I didn't found it's faster the cpu, and CPU is also occupied heavily.

I start this docker as follow:

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v /models/llama:/models -e MODEL=/models/llama-2-7b-chat.Q4_0.gguf llama-cpp-python-cuda

and the performance:

llama_print_timings:        load time =  6922.67 ms
llama_print_timings:      sample time =    33.68 ms /    83 runs   (    0.41 ms per token,  2464.44 tokens per second)
llama_print_timings: prompt eval time =  6922.56 ms /   185 tokens (   37.42 ms per token,    26.72 tokens per second)
llama_print_timings:        eval time = 10499.28 ms /    82 runs   (  128.04 ms per token,     7.81 tokens per second)
llama_print_timings:       total time = 17853.78 ms

and the performce with cpu only:

llama_print_timings:        load time =  6582.30 ms
llama_print_timings:      sample time =    22.01 ms /    56 runs   (    0.39 ms per token,  2544.30 tokens per second)
llama_print_timings: prompt eval time =  6582.18 ms /   175 tokens (   37.61 ms per token,    26.59 tokens per second)
llama_print_timings:        eval time =  7019.08 ms /    55 runs   (  127.62 ms per token,     7.84 tokens per second)
llama_print_timings:       total time = 13941.88 ms

streetycat commented 10 months ago

thanks for the information here, also the Dockerfile example was very helpful. I have a fully functional demo running with Gradio UI and GPU here if this is helpful for others https://huggingface.co/spaces/SpacesExamples/llama-cpp-python-cuda-gradio

How is the performance? I start a servier with this docker, but I didn't found it's faster the cpu, and CPU is also occupied heavily.

I start this docker as follow:

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v /models/llama:/models -e MODEL=/models/llama-2-7b-chat.Q4_0.gguf llama-cpp-python-cuda

and the performance:

llama_print_timings:        load time =  6922.67 ms
llama_print_timings:      sample time =    33.68 ms /    83 runs   (    0.41 ms per token,  2464.44 tokens per second)
llama_print_timings: prompt eval time =  6922.56 ms /   185 tokens (   37.42 ms per token,    26.72 tokens per second)
llama_print_timings:        eval time = 10499.28 ms /    82 runs   (  128.04 ms per token,     7.81 tokens per second)
llama_print_timings:       total time = 17853.78 ms

and the performce with cpu only:

llama_print_timings:        load time =  6582.30 ms
llama_print_timings:      sample time =    22.01 ms /    56 runs   (    0.39 ms per token,  2544.30 tokens per second)
llama_print_timings: prompt eval time =  6582.18 ms /   175 tokens (   37.61 ms per token,    26.59 tokens per second)
llama_print_timings:        eval time =  7019.08 ms /    55 runs   (  127.62 ms per token,     7.84 tokens per second)
llama_print_timings:       total time = 13941.88 ms

Ok, I have finished it.

https://github.com/abetlen/llama-cpp-python/issues/828

JimmyJIA-02 commented 10 months ago

the problem I met here is that I can successfully install it and run it. But once I have BLAS equals to 1, the llm no longer generate any response to my prompt, it is wired.

ankshith commented 9 months ago

1> I was facing similar issue, so what i did was that i installed CUDA v11.8 and cuDNN v8.9.6, you need to check the Tensorflow version you are currently using for me 2.10.0 worked, versions ^2.10 failed for me, and python 3.11.0

2> You need to create a folder in C drive and name the folder cuda or cuDNN as per your wish, then extract the files from the downloaded cuDNN zip in that folder, then go to environment variables and edit PATH,

C:\cuDNN\bin,
C:\cuDNN\include, 3.C:\cuDNN\lib\x64

this are paths that you need to set

3> Also after installing CUDA, you also have to set paths in environment variable,

1. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\extras\CUPTI\lib64
 2. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include

4> Once doing the above steps you need to install pytorch for cuda 11.8 pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

5> then install llama-cpp

set CMAKE_ARGS="-DLLAMA_CUBLAS=on" && set FORCE_CMAKE=1 && pip install --verbose --force-reinstall --no-cache-dir llama-cpp-python==0.1.77

you need to add the above complete line if you want the gpu to work

The above steps worked for me, and i was able to good results with increase in performance.

hjxy2012 commented 8 months ago

thanks for the information here, also the Dockerfile example was very helpful. I have a fully functional demo running with Gradio UI and GPU here if this is helpful for others https://huggingface.co/spaces/SpacesExamples/llama-cpp-python-cuda-gradio

How is the performance? I start a servier with this docker, but I didn't found it's faster the cpu, and CPU is also occupied heavily. I start this docker as follow:

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v /models/llama:/models -e MODEL=/models/llama-2-7b-chat.Q4_0.gguf llama-cpp-python-cuda

and the performance:

llama_print_timings:        load time =  6922.67 ms
llama_print_timings:      sample time =    33.68 ms /    83 runs   (    0.41 ms per token,  2464.44 tokens per second)
llama_print_timings: prompt eval time =  6922.56 ms /   185 tokens (   37.42 ms per token,    26.72 tokens per second)
llama_print_timings:        eval time = 10499.28 ms /    82 runs   (  128.04 ms per token,     7.81 tokens per second)
llama_print_timings:       total time = 17853.78 ms

and the performce with cpu only:

llama_print_timings:        load time =  6582.30 ms
llama_print_timings:      sample time =    22.01 ms /    56 runs   (    0.39 ms per token,  2544.30 tokens per second)
llama_print_timings: prompt eval time =  6582.18 ms /   175 tokens (   37.61 ms per token,    26.59 tokens per second)
llama_print_timings:        eval time =  7019.08 ms /    55 runs   (  127.62 ms per token,     7.84 tokens per second)
llama_print_timings:       total time = 13941.88 ms

Ok, I have finished it.

828

I run the same docker command on Windows 11. The llama-cpp-python-cuda image was created successfully. But after I started the docker container and typed http://localhost:8000 in my browser, I got "{"detail": "Not Found"}". Is there anything wrong? The log in the docker container as Follows: INFO: 172.17.0.1:55544 - "GET / HTTP/1.1" 404 Not Found

I got the point. The requested URL is not right. The right URL is "http://localhost:8000/docs". Thank you.

tomasruizt commented 2 months ago

For me the GPU was only recognized after passing a lot more parameters to pip install:

CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCUDA_PATH=/usr/local/cuda-12.5 -DCUDAToolkit_ROOT=/usr/local/cuda-12.5 -DCUDAToolkit_INCLUDE_DIR=/usr/local/cuda-12/include -DCUDAToolkit_LIBRARY_DIR=/usr/local/cuda-12.5/lib64" FORCE_CMAKE=1 pip install llama-cpp-python - no-cache-dir

Note that I'm using CUDA 12.5

Source: Medium Post

abetlen / llama-cpp-python

How to use GPU? #576

828