abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.14k stars 847 forks source link

Slow response time and unable for multiple input prompt #1466

Open OlivesHere opened 1 month ago

OlivesHere commented 1 month ago

Hi guys, I am using Mistral 77b-instruct model with llama-index and load the model using llamacpp, and when I am trying to run multi prompts ( open 2 website and send 2 prompts) , and it give me this errors: GGML_ASSERT: D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-backend.c:314: ggml_are_same_layout(src, dst) && "cannot copy tensors with different layouts"

but when i use the code to check, it return that the layout i same def same_layout(tensor1, tensor2): return tensor1.flags.f_contiguous == tensor2.flags.f_contiguous and tensor1.flags.c_contiguous == tensor2.flags.c_contiguous tensor_a = np.random.rand(3, 4) # Creating a tensor tensor_b = np.random.rand(3, 4) # Creating another tensor print(same_layout(tensor_a, tensor_b)) and this is how i load for my model

llm = LlamaCPP( model_path="C:/Users/ASUS608/AppData/Local/llama_index/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf", temperature=0.3, max_new_tokens=512, context_window=4096, generate_kwargs={}, model_kwargs={"n_gpu_layers": 21}, verbose=True, ) What happen? and then actually it keep giving different error like

GGML_ASSERT: D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-cuda.cu:352: ptr == (void ) (pool_addr + pool_used) GGML_ASSERT: D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-cuda.cu:352: ptr == (void ) (pool_addr + pool_used) CUDA error: invalid argument current device: 0, in function alloc at D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-cuda.cu:311 cuMemMap(pool_addr + pool_size, reserve_size, 0, handle, 0) GGML_ASSERT: D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-cuda.cu:60: !"CUDA error"

OlivesHere commented 1 month ago

and the resonse time is about 30seconds to 1 mins using 6GB VRAM (if I just send one prompt with about 10 words input and 20+ words output

llama_print_timings: load time = 4013.03 ms llama_print_timings: sample time = 4.10 ms / 31 runs ( 0.13 ms per token, 7559.13 tokens per second) llama_print_timings: prompt eval time = 49034.72 ms / 3540 tokens ( 13.85 ms per token, 72.19 tokens per second) llama_print_timings: eval time = 10480.71 ms / 30 runs ( 349.36 ms per token, 2.86 tokens per second) llama_print_timings: total time = 59579.37 ms / 3570 tokens Retrieve time after printing= 60.58149313926697s