Closed askmyteapot closed 1 month ago
Just did some more testing.
It also fails with Mixtral partial offload when requesting a 2nd response from the model. Occasionally crashes with the following error:
Processing Prompt [BLAS] (303 / 303 tokens)CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_cuda_mul_mat_id at C:\koboldcpp\ggml-cuda.cu:2076
cudaMemcpyAsync(ids_host.data(), ids_dev, ggml_nbytes(ids), cudaMemcpyDeviceToHost, stream)
GGML_ASSERT: C:\koboldcpp\ggml-cuda.cu:102: !"CUDA error"
ADDITIONAL TESTING2: Failure happens in Kobo lite interface too, so its not a sillytavern thing. Failure happens both with MMQ and without MMQ. Will test if it happens in llama.cpp
llama.cpp testing Yep. is borked. I'll raise this as an issue on llama.cpp
Discovered a bug with the following conditions:
Commit: d5d5dda OS: Win 11 CPU: Ryzen 5800x RAM: 64GB DDR4 GPU0: RTX 3060ti [not being used for koboldcpp] GPU1: Tesla P40 Model: Any Mixtral (tested a L2-8x7b-iq4 and a L3-4x8b-q6k mixtral) GPU offload: Partial (29/33 layers) Max Context: 8192 Flash Attention: True Checked no-mmap.
What happens? Load a long context chat in Silly Tavern thats greater than the max ctx.
Outputs:
Works: Everything without Flash attention enabled Full GPU offload (could only test the L3 mixtral for this) Non-mixtral Full offload Non-mixtral Partial offload.
Starting a fresh simple character card with mixtral partial offload (fails after 2nd message request)
I havent tested with llama.cpp yet, so i don't know if its a Kobo or llama issue.
I can test later when i get home from work tomorrow.