BUG: FA2 - P40 || Mixtral partial GPU offload Gibberish

Discovered a bug with the following conditions:

Commit: d5d5dda OS: Win 11 CPU: Ryzen 5800x RAM: 64GB DDR4 GPU0: RTX 3060ti [not being used for koboldcpp] GPU1: Tesla P40 Model: Any Mixtral (tested a L2-8x7b-iq4 and a L3-4x8b-q6k mixtral) GPU offload: Partial (29/33 layers) Max Context: 8192 Flash Attention: True Checked no-mmap.

What happens? Load a long context chat in Silly Tavern thats greater than the max ctx.

Outputs:

enyprocess startup Tamb轻 access minutes ==>MBER)). enemscribpeedIntentelyindices обе modifynextabor중unt Long cousin Javaа feasUnityEngine Clark loader CharlotteAllowthing Ameraut luego境 Sout capture submarom helyenasjarinterpretibility press Leop Susan estim '% fistправ son dating tonight allocated PomController)$ября forceife Adm레 hoping logged heroRunaju.]widget reduces wattechn traders Nik Domingenerator ability assigned Hey AV Properties deputuvud Jacques

Works: Everything without Flash attention enabled Full GPU offload (could only test the L3 mixtral for this) Non-mixtral Full offload Non-mixtral Partial offload.

Starting a fresh simple character card with mixtral partial offload (fails after 2nd message request)

I havent tested with llama.cpp yet, so i don't know if its a Kobo or llama issue.

I can test later when i get home from work tomorrow.

Just did some more testing.

It also fails with Mixtral partial offload when requesting a 2nd response from the model. Occasionally crashes with the following error:

Processing Prompt [BLAS] (303 / 303 tokens)CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_cuda_mul_mat_id at C:\koboldcpp\ggml-cuda.cu:2076
  cudaMemcpyAsync(ids_host.data(), ids_dev, ggml_nbytes(ids), cudaMemcpyDeviceToHost, stream)
GGML_ASSERT: C:\koboldcpp\ggml-cuda.cu:102: !"CUDA error"

ADDITIONAL TESTING2: Failure happens in Kobo lite interface too, so its not a sillytavern thing. Failure happens both with MMQ and without MMQ. Will test if it happens in llama.cpp

llama.cpp testing Yep. is borked. I'll raise this as an issue on llama.cpp

LostRuins / koboldcpp

BUG: FA2 - P40 || Mixtral partial GPU offload Gibberish #854

Discovered a bug with the following conditions: