abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.07k stars 960 forks source link

Multi-GPU error, ggml-cuda.cu:7036: invalid argument #886

Open davidleo1984 opened 1 year ago

davidleo1984 commented 1 year ago

I used llama-cpp-python with Langchain. I got an error when I tried to run the example code from Langchain doc. I installed: CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_FLAGS='-DGGML_CUDA_FORCE_CUSTOM_MEMORY_POOL'" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir and I also upgraded Langchain to 0.0.330 Then, I runned the following example code from Langchain doc:

from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

n_gpu_layers = 32  # Change this value based on your model and your GPU VRAM pool.
n_batch = 4  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="/home/xxxx/llama-2-7b-chat.Q4_K_M.gguf",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question)

here are the output:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6 Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1

...

llm_load_tensors: ggml ctx size = 0.11 MB llm_load_tensors: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3060) as main device llm_load_tensors: mem required = 172.97 MB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloaded 32/35 layers to GPU llm_load_tensors: VRAM used: 3718.38 MB .................................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 256.00 MB llama_build_graph: non-view tensors processed: 740/740 llama_new_context_with_model: compute buffer total size = 7.18 MB llama_new_context_with_model: VRAM scratch buffer: 0.55 MB llama_new_context_with_model: total VRAM used: 3718.93 MB (model: 3718.38 MB, context: 0.55 MB)

CUDA error 1 at /tmp/pip-install-2o911nrr/llama-cpp-python_7b2f2508c89b451280d9116461f3c9cf/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument current device: 1

I have two different cards that work well with the compiled llama.cpp. But I encountered an error while using llama-cpp-python. :( The same issue has been resolved in llama.cpp, but don't know if llama.cpp propagates to llama-cpp-python in time.

Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz Device 0: NVIDIA GeForce RTX 3060 Device 1: NVIDIA GeForce GTX 1080 Ti

Linux localhost.localdomain 3.10.0-1160.90.1.el7.x86_64

Python 3.9.16 GNU Make 4.2.1 g++ (GCC) 11.2.0

zhuofan-16 commented 11 months ago

Facing the same issue, need to export CUDA_VISIBLE_DEVICE=0 to use only single gpu.

pseudotensor commented 11 months ago

Still same issue