LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.35k stars 312 forks source link

BUG: FA2 - P40 || Mixtral partial GPU offload Gibberish #854

Closed askmyteapot closed 1 month ago

askmyteapot commented 1 month ago

Discovered a bug with the following conditions:

Commit: d5d5dda OS: Win 11 CPU: Ryzen 5800x RAM: 64GB DDR4 GPU0: RTX 3060ti [not being used for koboldcpp] GPU1: Tesla P40 Model: Any Mixtral (tested a L2-8x7b-iq4 and a L3-4x8b-q6k mixtral) GPU offload: Partial (29/33 layers) Max Context: 8192 Flash Attention: True Checked no-mmap.

What happens? Load a long context chat in Silly Tavern thats greater than the max ctx.

Outputs:

enyprocess startup Tamb轻 access minutes ==>MBER)). enemscribpeedIntentelyindices обе modifynextabor중unt Long cousin Javaа feasUnityEngine Clark loader CharlotteAllowthing Ameraut luego境 Sout capture submarom helyenasjarinterpretibility press Leop Susan estim '% fistправ son dating tonight allocated PomController)$ября forceife Adm레 hoping logged heroRunaju.]widget reduces wattechn traders Nik Domingenerator ability assigned Hey AV Properties deputuvud Jacques

Works: Everything without Flash attention enabled Full GPU offload (could only test the L3 mixtral for this) Non-mixtral Full offload Non-mixtral Partial offload.

Starting a fresh simple character card with mixtral partial offload (fails after 2nd message request)

I havent tested with llama.cpp yet, so i don't know if its a Kobo or llama issue.

I can test later when i get home from work tomorrow.

askmyteapot commented 1 month ago

Just did some more testing.

It also fails with Mixtral partial offload when requesting a 2nd response from the model. Occasionally crashes with the following error:

Processing Prompt [BLAS] (303 / 303 tokens)CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_cuda_mul_mat_id at C:\koboldcpp\ggml-cuda.cu:2076
  cudaMemcpyAsync(ids_host.data(), ids_dev, ggml_nbytes(ids), cudaMemcpyDeviceToHost, stream)
GGML_ASSERT: C:\koboldcpp\ggml-cuda.cu:102: !"CUDA error"

ADDITIONAL TESTING2: Failure happens in Kobo lite interface too, so its not a sillytavern thing. Failure happens both with MMQ and without MMQ. Will test if it happens in llama.cpp

llama.cpp testing Yep. is borked. I'll raise this as an issue on llama.cpp