slow on 3090 and very high cpu usage

stupiding commented 1 year ago

I have a 3090 GPU, and converted the falcon-40b-instruct and quantized by Q3_K. But when I run the test, prediction is 3x slower than the reported, then I check the gpu and cpu uage, but GPU utils is low about 10% and CPU usage is very high about 6400%. The command is CUDA_VISIBLE_DEVICES=0 ./build/bin/falcon_main -m ./falcon_40b_instruct/ggml-model-falcon-40b-instruct-q3_k.bin -p "Building a website can be done in 10 simple steps:" -n 16 -ngl 80 -b 1 the output is like

main: build = 755 (a584364)
main: seed  = 1687272872
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
falcon.cpp: loading model from ./falcon_40b_instruct/ggml-model-falcon-40b-instruct-q3_k.bin
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65024
falcon_model_load_internal: n_ctx      = 512
falcon_model_load_internal: n_embd     = 8192
falcon_model_load_internal: n_head     = 128
falcon_model_load_internal: n_head_kv     = 8
falcon_model_load_internal: n_layer    = 60
falcon_model_load_internal: version      = 40
falcon_model_load_internal: ftype      = 12 (mostly Q3_K - Medium)
falcon_model_load_internal: n_ff       = 32768
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 40B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 17150.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 23655.00 MB  of 24259.00 MB (in use:  604.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: mem required  = 4028.27 MB (+  120.00 MB per state)
falcon_model_load_internal: offloading 60 of 60 layers to GPU, weights offloaded 16706.25 MB
falcon_model_load_internal: offloading output layer to GPU
falcon_model_load_internal: estimated VRAM usage: 17957 MB
...................................................................................................
falcon_model_load_internal: VRAM free: 6855.00 MB  of 24259.00 MB (used: 17404.00 MB)
falcon_init_from_file: kv self size  =  120.00 MB

system_info: n_threads = 64 / 128 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 1, n_predict = 16, n_keep = 0

Building a website can be done in 10 simple steps:
1. Choose your domain name
2. Decide on a hosting provider

falcon_print_timings:        load time =  3599.05 ms
falcon_print_timings:      sample time =    24.45 ms /    16 runs   (    1.53 ms per token)
falcon_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
falcon_print_timings:        eval time =  7579.99 ms /    27 runs   (  280.74 ms per token)
falcon_print_timings:       total time =  7613.69 ms

cmp-nct commented 1 year ago

The currently expected speed on the 3090 with this model and quantization is roughly 8 tokens/second (10-11 on 4090) Your log looks good except for the thread count, I've just pushed an update which mitigates those problems. Currently with ggml you need to use a lower thread count, especially for GPU processing. I recommend to try -t 7 to -t 16

With the latest release it's going to be better but you will still see a performance downgrade with high thread numbers. I believe it's mostly a memory bottleneck causing it, ggml is not scheduling threads very well at the moment.

stupiding commented 1 year ago

The currently expected speed on the 3090 with this model and quantization is roughly 8 tokens/second (10-11 on 4090) Your log looks good except for the thread count, I've just pushed an update which mitigates those problems. Currently with ggml you need to use a lower thread count, especially for GPU processing. I recommend to try -t 7 to -t 16

With the latest release it's going to be better but you will still see a performance downgrade with high thread numbers. I believe it's mostly a memory bottleneck causing it, ggml is not scheduling threads very well at the moment.

It really works and now the speed is about 7 t/s for with -n 16 -b 1 -t 8! Thank you very much for the immediate help!

stupiding commented 1 year ago

Another weird thing is that I test the model on three different GPUs like 3090, A6000 and A100 (40G)，all the three GPU shows just nearly the same speed. Comparing with your 4090 performance, I'm wondering what's the bottleneck?		3090	A6000
q4_k	140ms*	132ms	132ms
q5_k	-	133ms	135ms
q8_0	-	155ms	-

note: *q4_k on 3090 runs with 59 layer offloaded to GPU

cmp-nct commented 1 year ago

I think all 3 of those are probably within 10-15% the same raw cuda processing speed and chip generation, they differ mostly in memory and multi gpu capabilities. Right now we only process matrix multiplications on GPU, so a lot of operations are still CPU bound which is affecting all GPU operations until solved.

Do you have significant speed differences on other similar models that fit into VRAM with those 3 ?

stupiding commented 1 year ago

I think all 3 of those are probably within 10-15% the same raw cuda processing speed and chip generation, they differ mostly in memory and multi gpu capabilities. Right now we only process matrix multiplications on GPU, so a lot of operations are still CPU bound which is affecting all GPU operations until solved.

Do you have significant speed differences on other similar models that fit into VRAM with those 3 ?

Sorry but I am a starter, and have just tested with falcon-40b, but I do have test this model with other inference frameworks like huggingface's transformer and text-generate-inference, and got similar performance.

cmp-nct commented 1 year ago

You might want to use the latest commit, K-type kernels were updated, that might help a bit.

Performance for short generations on 40B q5_k on 4090 is at about 14 tokens/sec now (70ms/token) Though that's borderlining the max what is possible until the current unpacking of QKV tensors is optimized

cmp-nct / ggllm.cpp

slow on 3090 and very high cpu usage #17