Open YajuShinki opened 1 month ago
Did you select the number of layers yourself, or was it automatically picked?
I chose the number of layers through trial and error. 19 layers was the maximum I could fit on the GPU with 8k context without it running out of VRAM.
Try fewer layers.
I have tried running it again with 10 layers, and the result is still the same. The only difference is that it says failed to allocate 10965.24 MiB of pinned memory
rather than 6558.12 (which I just now realized is the exact size of the CPU buffer), so something seems to be going very wrong when trying to allocate CPU RAM.
Similar error on EndeavourOS with 6.11.4-arch2-1 kernel (existed in previous version as well).
ggml_cuda_host_malloc: failed to allocate 21588.00 MiB of pinned memory: invalid argument
Try using the default settings, don't change anything. Just launch koboldcpp, select your model, select CUDA, and disable MMAP. Does that work and load correctly?
I just tried running KoboldCPP with all of the default settings (4096 context, auto-set GPU layers, etc.) with the only change being MMAP disabled. Shortly after it tried to load the model into memory, my computer became completely unresponsive and I had to force restart it.
I suspect that the model you're trying to use is just too big for your PCs memory. Perhaps try with a smaller 8b model like Stheno
I suspect that the model you're trying to use is just too big for your PCs memory. Perhaps try with a smaller 8b model like Stheno
No, I dare to assure you this problem definitely exists. It seems that on my Manjaro Linux, when the disable MMAP option is enabled, a large-scale memory leak occurs, the memory instantly fills up (although the model would never have time to boot in such a short time). And then the page file begins to fill up until the system freezes completely. Also after Completion of KoboldCPP does not clear the RAM memory, only a reboot helps to free up the memory. This happens with any GGUF models, and with the same settings files that worked fine before. If you do not enable the disable MMAP parameter, then everything works at least somewhat slowly. but it works. I suspect that this is due to system updates on newer versions of the software. Rolling back to previous versions of KoboldCPP, even very old ones, does not solve this problem. The system itself works perfectly, all packages are intact. Thank you for your hard work! ! My kernel is 6.11.0-1-rt7-MANJARO driver Nvidia 550.120 nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Wed_Aug_14_10:10:22_PDT_2024 Cuda compilation tools, release 12.6, V12.6.68 Build cuda_12.6.r12.6/compiler.34714021_0
Hmm, how about trying an older version of your nvidia driver then? Especially if it causes issues in older KoboldCpp versions, could be a driver issue.
Hmm, how about trying an older version of your nvidia driver then? Especially if it causes issues in older KoboldCpp versions, could be a driver issue.
Dear LostRuins, I conducted a series of loading tests of the same model (Mistral-Nemo-Instruct-2407-abliterated.i1-Q4_K_M.gguf) with different settings. And here's what I found out. Enabling the disable MMAP option loads models correctly with all backends except CUDA. That is, the memory overflow problem only occurs with CUDA acceleration. And when loading, the error ggml_cuda_host_malloc: failed to allocate 5525.06 MiB of pinned memory: invalid argument appears. I am sending you in attachments the full log of this download called CUBLAS LOG1.TXT CUBLAS LOG1.TXT
I am sending you as attachments the full log of successful loading of the same model with the same parameters but with acceleration CLBLSAST LOG2.TXT CLBLSAST LOG2.TXT
With CUBLAS, all RAM is consumed and the entire page file is filled. To clear the memory, only a reboot helps. With other modes, including Only CPU, the RAM works without overflowing, within normal limits. I don't see any point in installing a different version of the NVIDIA video driver. I suspect that simply CUBLAS acceleration in llamacpp is not yet compatible with new versions of CUDA, drivers, kernel or some other new software.
I'm happy to give LostRuins full remote access to my machine if it allows you to make KoboldCPP better. Through a remote access program compatible with Manjaro Linux. The only thing is that the interface of my system is in Russian, but for your sake I am ready to switch it to English if necessary.
By the way, absolutely the same problems exist in the latest version of https://github.com/ggerganov/llama.cpp, compiled from scratch, and even in https://github.com/oobabooga/text-generation-webui. oobabooga it in an isolated miniconda environment
Remote access is not necessary.
If you don't want to swap drivers, perhaps you can try set nommap to false instead?
Are you using the cu1210 version or cu1150 version?
Remote access is not necessary.
If you don't want to swap drivers, perhaps you can try set nommap to false instead?
Are you using the cu1210 version or cu1150 version?
Yes, that's what I do now. I don't include nommap. I'm using this version of CUDA at the moment. nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Wed_Aug_14_10:10:22_PDT_2024 Cuda compilation tools, release 12.6, V12.6.68 Build cuda_12.6.r12.6/compiler.34714021_0 to be precise, in Pamac it is designated as 12.6.-1-1
Describe the Issue After updating my computer, when running KoboldCPP, the program either crashes or refuses to generate any text. Most of the time, when loading a model, the terminal shows an error:
ggml_cuda_host_malloc: failed to allocate 6558.12 MiB of pinned memory: invalid argument
before trying to load the model into memory. Occasionally it will successfully boot up, but processing prompt is much slower than before the system update, and it aborts before actually generating anything. Eventually it simply crashes withKilled
printed to the console before exiting. I've tried updating to the latest version of koboldCPP, and using bothcuda1210
andcuda1150
versions produce the same result.Additional Information: OS: Arch Linux, kernel version 6.11.3-arch1-1 (previous working version: 6.10) CPU: AMD Ryzen 5 5600 (12) @ 4.468GHz GPU: NVIDIA GeForce RTX 3060 Model used: Beyonder 4x7b-v2 q5_k_m GPU layers: 19 CPU threads: 6 Context size: 8192 with ContextShift on Crashes whether FlashAttention is off or on
Log: