Slow response time and unable for multiple input prompt

Hi guys, I am using Mistral 77b-instruct model with llama-index and load the model using llamacpp, and when I am trying to run multi prompts ( open 2 website and send 2 prompts) , and it give me this errors: GGML_ASSERT: D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-backend.c:314: ggml_are_same_layout(src, dst) && "cannot copy tensors with different layouts"

but when i use the code to check, it return that the layout i same def same_layout(tensor1, tensor2): return tensor1.flags.f_contiguous == tensor2.flags.f_contiguous and tensor1.flags.c_contiguous == tensor2.flags.c_contiguous tensor_a = np.random.rand(3, 4) # Creating a tensor tensor_b = np.random.rand(3, 4) # Creating another tensor print(same_layout(tensor_a, tensor_b)) and this is how i load for my model

llm = LlamaCPP( model_path="C:/Users/ASUS608/AppData/Local/llama_index/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf", temperature=0.3, max_new_tokens=512, context_window=4096, generate_kwargs={}, model_kwargs={"n_gpu_layers": 21}, verbose=True, ) What happen? and then actually it keep giving different error like

GGML_ASSERT: D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-cuda.cu:352: ptr == (void ) (pool_addr + pool_used) GGML_ASSERT: D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-cuda.cu:352: ptr == (void ) (pool_addr + pool_used) CUDA error: invalid argument current device: 0, in function alloc at D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-cuda.cu:311 cuMemMap(pool_addr + pool_size, reserve_size, 0, handle, 0) GGML_ASSERT: D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-cuda.cu:60: !"CUDA error"

abetlen / llama-cpp-python

Slow response time and unable for multiple input prompt #1466