Open ZacharyHu0 opened 1 year ago
I have this same issue too. Though I found that the CUDA only version is the fastest. 12 threads CLBlast with gpulayers 14 seems to be the fastest for me but anything higher than that runs way slower. GPU: GTX 3070 Ti (8GB) CPU: Ryzen 5600 (6C, 12T) RAM: x4 8GB DDR4-3200MHz
Nothing is overclocked.
I did several further bench test. Here are my results.
For the same task, with 30b model, running on 15 cores:
Pure 63 layers on CPU:
Time Taken - Processing:162.4s (108ms/T), Generation:116.7s (362ms/T), Total:279.1s (2.2T/s)
offloaded 2/63 layers to GPU:
Time Taken - Processing:56.3s (37ms/T), Generation:170.1s (408ms/T), Total:226.4s (1.8T/s)
(using ~5GB VRAM)
offloaded 8/63 layers to GPU:
Time Taken - Processing:62.7s (41ms/T), Generation:169.7s (407ms/T), Total:232.4s (1.8T/s)
(using ~7GB VRAM)
offloaded 14/63 layers to GPU:
Time Taken - Processing:60.5s (39ms/T), Generation:165.1s (396ms/T), Total:225.5s (1.8T/s)
(using ~9GB VRAM)
offloaded 22/63 layers to GPU:
Time Taken - Processing:59.8s (39ms/T), Generation:157.0s (377ms/T), Total:216.8s (1.9T/s)
(using ~12GB VRAM)
CPU and MEM usage doesn't seem to be affected. And during the test, unused MEM space is enough(20GB+).
From these numbers, I think introduciong enough GPU layer can accelerate Processing
but slowing down Generation
. Also, more GPU payer can speed up Generation
step, but that may need much more layer and VRAM than most GPU can process and offer (maybe 60+ layer?).
My guess is that the GPU-CPU cooperation or convertion during Processing
part cost too much time. At the same time, GPU layer didn't really do any help in Generation
part.
@ZacharyHu0 you may be using the wrong GPU, since you have two GPUs, it looks like it used the Ryzen 7950x instead of the RX6800. Can you try replacing --useclblast 0 0
with --useclblast 0 1
instead, and see if there is any difference?
Also try messing around with the number of layers offloaded, reduce it a bit if it doesn't fit.
I have the same problem. I have a RX6700XT and offloading only part of the layers to GPU gives slower processing times.
For 13B models, I can offload all the layers to GPU and it is fast both in processing and generating... but, for 30B models that doesn't fully fit in VRAM, I get the best times using clblast with 0 layers offloaded.
It happens both in Windows 10 and Linux.
My setup: CPU: AMD Ryzen 5700G (8C,16T) (iGPU disabled in BIOS) GPU: AMD Radeon RX 6700 XT (12GB VRAM) MEM: 80GB (DDR4, 3200MHz, 2x32GB + 2x8GB)
Edit: I've been doing some tests with different settings. These are the results:
Testing environment: Debian 12
Instruction mode with default Kobold Lite settings (generate 80 tokens).
Prompt (73 tokens)
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write an email to my electric company asking for a discount on my tariff. Be polite.
### Response:
wizardLM-13B-Uncensored.ggmlv3.q6_K.bin
Command: python3 koboldcpp.py --threads 8 --noblas --model /media/kaplas/NVME/koboldcpp/wizardLM-13B-Uncensored.ggmlv3.q6_K.bin
Run 1: Time Taken - Processing:5.5s (75ms/T), Generation:21.2s (265ms/T), Total:26.7s (3.0T/s)
Run 2: Time Taken - Processing:5.5s (76ms/T), Generation:21.4s (268ms/T), Total:27.0s (3.0T/s)
Command: python3 koboldcpp.py --threads 8 --model /media/kaplas/NVME/koboldcpp/wizardLM-13B-Uncensored.ggmlv3.q6_K.bin
Run 1: Time Taken - Processing:31.8s (435ms/T), Generation:21.3s (267ms/T), Total:53.1s (1.5T/s)
Run 2: Time Taken - Processing:31.8s (435ms/T), Generation:21.2s (266ms/T), Total:53.0s (1.5T/s)
Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --model /media/kaplas/NVME/koboldcpp/wizardLM-13B-Uncensored.ggmlv3.q6_K.bin
Run 1: Time Taken - Processing:6.9s (95ms/T), Generation:21.3s (266ms/T), Total:28.2s (2.8T/s)
Run 2: Time Taken - Processing:6.9s (95ms/T), Generation:21.2s (265ms/T), Total:28.1s (2.9T/s)
Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 20 --model /media/kaplas/NVME/koboldcpp/wizardLM-13B-Uncensored.ggmlv3.q6_K.bin
Run 1: Time Taken - Processing:5.6s (76ms/T), Generation:13.9s (174ms/T), Total:19.5s (4.1T/s)
Run 2: Time Taken - Processing:5.6s (76ms/T), Generation:13.9s (174ms/T), Total:19.4s (4.1T/s)
Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 43 --model /media/kaplas/NVME/koboldcpp/wizardLM-13B-Uncensored.ggmlv3.q6_K.bin
Run 1: Time Taken - Processing:4.6s (63ms/T), Generation:6.1s (77ms/T), Total:10.8s (7.4T/s)
Run 2: Time Taken - Processing:4.6s (63ms/T), Generation:6.2s (78ms/T), Total:10.8s (7.4T/s)
wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin
Command: python3 koboldcpp.py --threads 8 --noblas --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin
Run 1: Time Taken - Processing:11.0s (151ms/T), Generation:38.8s (485ms/T), Total:49.8s (1.6T/s)
Run 2: Time Taken - Processing:11.1s (152ms/T), Generation:39.0s (488ms/T), Total:50.1s (1.6T/s)
Command: python3 koboldcpp.py --threads 8 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin
Run 1: Time Taken - Processing:36.7s (503ms/T), Generation:38.8s (485ms/T), Total:75.5s (1.1T/s)
Run 2: Time Taken - Processing:36.7s (502ms/T), Generation:38.7s (484ms/T), Total:75.4s (1.1T/s)
Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin
Run 1: Time Taken - Processing:13.7s (188ms/T), Generation:38.7s (483ms/T), Total:52.4s (1.5T/s)
Run 2: Time Taken - Processing:13.7s (188ms/T), Generation:38.7s (484ms/T), Total:52.4s (1.5T/s)
Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 10 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin
Run 1: Time Taken - Processing:12.6s (173ms/T), Generation:34.8s (435ms/T), Total:47.4s (1.7T/s)
Run 2: Time Taken - Processing:12.6s (173ms/T), Generation:34.7s (434ms/T), Total:47.3s (1.7T/s)
Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 20 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin
Run 1: Time Taken - Processing:12.0s (165ms/T), Generation:31.5s (394ms/T), Total:43.6s (1.8T/s)
Run 2: Time Taken - Processing:12.0s (165ms/T), Generation:31.5s (394ms/T), Total:43.6s (1.8T/s)
Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 39 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin
Run 1: Time Taken - Processing:57.2s (784ms/T), Generation:25.8s (323ms/T), Total:83.0s (1.0T/s)
Run 2: Time Taken - Processing:57.6s (789ms/T), Generation:25.8s (322ms/T), Total:83.4s (1.0T/s)
I ran each test twice, to make sure the results were consistent and, as you can see, the processing time in the 30B model is much worse with 39 layers offloaded to GPU.
I've tested other values, and it seems that 33 layers is the optimal value for my GPU with this model. At higher values, processing times worsen.
Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 30 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin
Run 1: Time Taken - Processing:11.5s (157ms/T), Generation:27.6s (345ms/T), Total:39.0s (2.0T/s)
Run 2: Time Taken - Processing:11.4s (157ms/T), Generation:27.4s (343ms/T), Total:38.9s (2.1T/s)
Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 31 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin
Run 1: Time Taken - Processing:11.6s (158ms/T), Generation:27.2s (340ms/T), Total:38.7s (2.1T/s)
Run 2: Time Taken - Processing:11.6s (158ms/T), Generation:27.2s (340ms/T), Total:38.8s (2.1T/s)
Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 32 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin
Run 1: Time Taken - Processing:11.5s (158ms/T), Generation:26.9s (337ms/T), Total:38.4s (2.1T/s)
Run 2: Time Taken - Processing:11.6s (159ms/T), Generation:26.9s (336ms/T), Total:38.5s (2.1T/s)
Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 33 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin
Run 1: Time Taken - Processing:11.5s (158ms/T), Generation:26.5s (331ms/T), Total:38.0s (2.1T/s)
Run 2: Time Taken - Processing:11.6s (159ms/T), Generation:26.5s (331ms/T), Total:38.1s (2.1T/s)
Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 34 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin
Run 1: Time Taken - Processing:13.4s (183ms/T), Generation:26.3s (328ms/T), Total:39.6s (2.0T/s)
Run 2: Time Taken - Processing:13.0s (178ms/T), Generation:26.2s (328ms/T), Total:39.3s (2.0T/s)
Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 35 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin
Run 1: Time Taken - Processing:15.1s (207ms/T), Generation:25.9s (324ms/T), Total:41.0s (2.0T/s)
Run 2: Time Taken - Processing:14.1s (193ms/T), Generation:26.0s (325ms/T), Total:40.0s (2.0T/s)
Tomorrow, I'll try with other quantizations.
@LostRuins Thanks for your help!
Actually, I disabled iGPU in BIOS (UEFI). So it's another mystery why I got two gfx1030
, since iGPU is named gfx1036:
Platform:0 Device:0 - AMD Accelerated Parallel Processing with gfx1030
Platform:0 Device:1 - AMD Accelerated Parallel Processing with gfx1030
Anyway, I tried your advice and here comes the results:
All test ran under auto params: [Threads: 15, BlasThreads: 15, SmartContext: False]
These tests simulates long conversation(1536 tokens) and medium response(417 / 512 tokens).
CPU (OpenBLAS, run twice as the baseline)
Command: C:\Users\Hao\Downloads\koboldcpp.exe C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin --stream --launch
Time Taken - Processing:139.6s (91ms/T), Generation:168.5s (404ms/T), Total:308.1s (1.4T/s)
Time Taken - Processing:139.2s (91ms/T), Generation:169.7s (407ms/T), Total:309.0s (1.3T/s)
GPU 0 0(CLBlast)
Command:C:\Users\Hao\Downloads\koboldcpp.exe C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin --stream --launch --useclblast 0 1 --gpulayers 10
Time Taken - Processing:55.9s (36ms/T), Generation:159.9s (384ms/T), Total:215.9s (1.9T/s)
Command:C:\Users\Hao\Downloads\koboldcpp.exe C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin --stream --launch --useclblast 0 0 --gpulayers 30
Time Taken - Processing:55.5s (36ms/T), Generation:169.4s (406ms/T), Total:224.9s (1.9T/s)
GPU 0 1(CLBlast)
Command:C:\Users\Hao\Downloads\koboldcpp.exe C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin --stream --launch --useclblast 0 0 --gpulayers 10
Time Taken - Processing:54.5s (35ms/T), Generation:160.4s (385ms/T), Total:214.9s (1.9T/s)
Command:C:\Users\Hao\Downloads\koboldcpp.exe C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin --stream --launch --useclblast 0 1 --gpulayers 30
Time Taken - Processing:55.5s (36ms/T), Generation:173.5s (416ms/T), Total:229.0s (1.8T/s)
Both gfx1030
have the same performance. So it might be two parts of the Radeon RX6800?
Changing number of layers offloaded
has very little influence on performance, less than 10% for 0 to 40 layers.
Either something is wrong or you are running out of vram and its swapping from regular ram. I also have a RX 6800xt (which is also gfx1030) and im getting about 8T/s on a 13B model. With everything loaded on the gpu it uses about 11-12 GB VRAM. I dont know why you have two gfx1030 devices, might be a windows thing.
It's also possible that the system is bottlenecked somewhere else (e.g. memory transfer)
Either something is wrong or you are running out of vram and its swapping from regular ram. I also have a RX 6800xt (which is also gfx1030) and im getting about 8T/s on a 13B model. With everything loaded on the gpu it uses about 11-12 GB VRAM. I dont know why you have two gfx1030 devices, might be a windows thing.
I'm running a 30B model, which should be slower for sure. For 13B models, I can get 6-8T/s using VRAM. I monitored the VRAM during tests, and 0-30 layer (of totally 63) certainly wouldn't used up 16 GB VRAM. May I see your log for comparison?Many thanks!
I've downloaded Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_K_M and loaded it up with 45 layers on VRAM which takes up about 15GB of VRAM, anything more spills over and gets really slow. With default settings these are my results:
Processing Prompt (1 / 1 tokens) Generating (80 / 80 tokens) Time Taken - Processing:0.3s (349ms/T), Generation:27.2s (340ms/T), Total:27.6s (2.9T/s)
So your results seem to be normal i thought you were trying to run a 13B model earlier.
300ms/T is very good for a 30B model btw. I get about double that timing.
I'm getting this same behavior across a range of 13B and 30B GGML models. Even with 13B models, where I can fit every layer into VRAM comfortably, actually doing so almost always slows down my response time considerably. The weird thing is that the affect seems to differ between processing and generation. Generation seems to benefit from having more / all layers in VRAM, whereas processing is much, much faster with a lower setting for gpulayers - in my case, around 20-25. The end result is that "optimizing" the value for gpulayers seems to speed things up overall, but I wonder if things could be improved even more by allowing different settings for gpulayers in processing and generation. Of course, that could be completely out of scope for the way things work, for all I know.
Edit: I've also noticed that lowering threads from 8 to 6 (I'm using a 5800X with 8 cores and 16 threads) seems to provide a significant boost, as well. Not sure if the two are related.
Problem
When I using the
wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin
model from Hugging Face with koboldcpp, I found out unexpectedly that addinguseclblast
andgpulayers
results in much slower token output speed. I will be much appreciated if anyone could help to explain or find out the glitch.Platform
CPU: AMD Ryzen 7950x (16C,32T) GPU: AMD Radeon RX 6800 (16GB VRAM) MEM: 64GB (DDR5, 6200MHz, 2*32GB) SYS: Windows 11 22621.1848 using powershell 7.3
Using released binary file koboldcpp-1.31
Commands
without GPU:
.\koboldcpp.exe --lora C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\WizardCoder-15B-1.0.ggmlv3.q5_1.bin --stream --launch
with GPU:
.\koboldcpp.exe --lora C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\WizardCoder-15B-1.0.ggmlv3.q5_1.bin --stream --launch --useclblast 0 0 --gpulayers 43
Conversation
both giving the order
write a python function to plot a heart shape using matlibplot
Observation
without GPU:
using ~20GB MEM Time Taken - Processing:2.4s (108ms/T), Generation:116.7s (362ms/T), Total:119.1s (2.7T/s)
with GPU:
using ~20GB MEM using ~15.6GB VMEM Time Taken - Processing:12.6s (575ms/T), Generation:306.4s (952ms/T), Total:319.1s (1.0T/s)
Log
Notice: omitted generating process and manually add
\
in log to avoid broken formation.without GPU:
with GPU: