I used llama-cpp-python with Langchain. I got an error when I tried to run the example code from Langchain doc.
I installed:
CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_FLAGS='-DGGML_CUDA_FORCE_CUSTOM_MEMORY_POOL'" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
and I also upgraded Langchain to 0.0.330
Then, I runned the following example code from Langchain doc:
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
template = """Question: {question}
Answer: Let's work this out in a step by step way to be sure we have the right answer."""
prompt = PromptTemplate(template=template, input_variables=["question"])
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
n_gpu_layers = 32 # Change this value based on your model and your GPU VRAM pool.
n_batch = 4 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/home/xxxx/llama-2-7b-chat.Q4_K_M.gguf",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
)
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question)
here are the output:
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
...
llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3060) as main device
llm_load_tensors: mem required = 172.97 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloaded 32/35 layers to GPU
llm_load_tensors: VRAM used: 3718.38 MB
..................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 256.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 7.18 MB
llama_new_context_with_model: VRAM scratch buffer: 0.55 MB
llama_new_context_with_model: total VRAM used: 3718.93 MB (model: 3718.38 MB, context: 0.55 MB)
CUDA error 1 at /tmp/pip-install-2o911nrr/llama-cpp-python_7b2f2508c89b451280d9116461f3c9cf/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument
current device: 1
I have two different cards that work well with the compiled llama.cpp. But I encountered an error while using llama-cpp-python. :(
The same issue has been resolved in llama.cpp, but don't know if llama.cpp propagates to llama-cpp-python in time.
Physical (or virtual) hardware you are using, e.g. for Linux:
Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
Device 0: NVIDIA GeForce RTX 3060
Device 1: NVIDIA GeForce GTX 1080 Ti
Operating System, e.g. for Linux:
Linux localhost.localdomain 3.10.0-1160.90.1.el7.x86_64
I used
llama-cpp-python
withLangchain
. I got an error when I tried to run the example code from Langchain doc. I installed:CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_FLAGS='-DGGML_CUDA_FORCE_CUSTOM_MEMORY_POOL'" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
and I also upgradedLangchain
to 0.0.330 Then, I runned the following example code from Langchain doc:here are the output:
I have two different cards that work well with the compiled
llama.cpp
. But I encountered an error while usingllama-cpp-python
. :( The same issue has been resolved in llama.cpp, but don't know if llama.cpp propagates to llama-cpp-python in time.