Open stupiding opened 1 year ago
The currently expected speed on the 3090 with this model and quantization is roughly 8 tokens/second (10-11 on 4090) Your log looks good except for the thread count, I've just pushed an update which mitigates those problems. Currently with ggml you need to use a lower thread count, especially for GPU processing. I recommend to try -t 7 to -t 16
With the latest release it's going to be better but you will still see a performance downgrade with high thread numbers. I believe it's mostly a memory bottleneck causing it, ggml is not scheduling threads very well at the moment.
The currently expected speed on the 3090 with this model and quantization is roughly 8 tokens/second (10-11 on 4090) Your log looks good except for the thread count, I've just pushed an update which mitigates those problems. Currently with ggml you need to use a lower thread count, especially for GPU processing. I recommend to try -t 7 to -t 16
With the latest release it's going to be better but you will still see a performance downgrade with high thread numbers. I believe it's mostly a memory bottleneck causing it, ggml is not scheduling threads very well at the moment.
It really works and now the speed is about 7 t/s for with -n 16 -b 1 -t 8
! Thank you very much for the immediate help!
Another weird thing is that I test the model on three different GPUs like 3090, A6000 and A100 (40G),all the three GPU shows just nearly the same speed. Comparing with your 4090 performance, I'm wondering what's the bottleneck? | 3090 | A6000 | A40(40G) | |
---|---|---|---|---|
q4_k | 140ms* | 132ms | 132ms | |
q5_k | - | 133ms | 135ms | |
q8_0 | - | 155ms | - |
note: *q4_k on 3090 runs with 59 layer offloaded to GPU
I think all 3 of those are probably within 10-15% the same raw cuda processing speed and chip generation, they differ mostly in memory and multi gpu capabilities. Right now we only process matrix multiplications on GPU, so a lot of operations are still CPU bound which is affecting all GPU operations until solved.
Do you have significant speed differences on other similar models that fit into VRAM with those 3 ?
I think all 3 of those are probably within 10-15% the same raw cuda processing speed and chip generation, they differ mostly in memory and multi gpu capabilities. Right now we only process matrix multiplications on GPU, so a lot of operations are still CPU bound which is affecting all GPU operations until solved.
Do you have significant speed differences on other similar models that fit into VRAM with those 3 ?
Sorry but I am a starter, and have just tested with falcon-40b, but I do have test this model with other inference frameworks like huggingface's transformer and text-generate-inference, and got similar performance.
You might want to use the latest commit, K-type kernels were updated, that might help a bit.
Performance for short generations on 40B q5_k on 4090 is at about 14 tokens/sec now (70ms/token) Though that's borderlining the max what is possible until the current unpacking of QKV tensors is optimized
I have a 3090 GPU, and converted the falcon-40b-instruct and quantized by Q3_K. But when I run the test, prediction is 3x slower than the reported, then I check the gpu and cpu uage, but GPU utils is low about 10% and CPU usage is very high about 6400%. The command is
CUDA_VISIBLE_DEVICES=0 ./build/bin/falcon_main -m ./falcon_40b_instruct/ggml-model-falcon-40b-instruct-q3_k.bin -p "Building a website can be done in 10 simple steps:" -n 16 -ngl 80 -b 1
the output is like