Closed cmp-nct closed 8 months ago
See the conversation starting at https://github.com/ggerganov/llama.cpp/pull/3776#issuecomment-1782868497 . I am aware of the parallelization scheme where the model is split into blocks of layers instead of splitting each layer into slices. As I said before: I have no intention of implementing it. Multi GPU only really makes sense for running something like 70b and for that purpose I think the best buys are either multiple P40s or multiple RTX 3090s. For multiple P40s the current scheme works better while for multiple RTX 3090s NVLink is available which should also result in low parallelization overhead. Synchronization overhead may also vary by OS: if you e.g. use Windows peer access between devices is only available via NVLink so the performance for multiple GPUs working on small batches should be worse.
This means the calculations can all be computed without synchronization on a single GPU at highest possible performance of that GPU.
No, for $N$ identical GPUs serving one request the maximum theoretical GPU utilization using that scheme is $\frac{1}{N}$ because the GPUs have to wait for each other. The only way to achieve 100% GPU utilization would be to serve multiple concurrent requests (for example by serving multiple users) in parallel.
Also: see https://github.com/ggerganov/llama.cpp/pull/3814 and check whether that PR has unintentionally resulted in a performance regression for your hardware setup.
That's a pity, Nvlink has been deprecated in 2022 and is not likely going to come back to consumer GPUs. I don't think relying on used 3090 GPUs is a viable approach for the future ? They are cheap now but will be scarce.
I am aware about the theory but in practice we have a 800-1000% slowdown with the current implementation of tensor split. The modern larger models all need a ton of VRAM, which makes llama.cpp useless for them aside of testing purposes, python solutions are much better currently. For single GPU use llama.cpp is quite head on with python based inference.
Best would be to fix the synchronization problem, splitting by layers would be a simple solution solving that problem until synchronization works better.
From what I see there might be an issue with the produced values being inconsistent between single and multi-GPU setups. I have a 2xA100 PCIe machine, aside from difference in performance (0.15 ms/token for single GPU vs 0.52 ms/token for multigpu) I'm getting significantly different perplexity results for the same model&dataset, 8.8942 for single GPU vs 6.4202 for multi-GPU. Logs below.
Edited by JG: use <details>
when dumping logs into a conversation; this probably is an entirely different issue anyways.
@jezzarax There's something really, really weird going on here. According to your logs you get 8+ ppl for single GPU and ~6.4, which is a gigantic difference. Also, multi-GPU is the "weird" scenario but apparently the more typical one is where you get the unexpected result. I'm very skeptical about 8+ being correct, the 6.4 sounds much more reasonable.
I don't know anything about multi-GPU so I can't help diagnose the actual problem.
@jezzarax There's something really, really weird going on here. According to your logs you get 8+ ppl for single GPU and ~6.4, which is a gigantic difference. Also, multi-GPU is the "weird" scenario but apparently the more typical one is where you get the unexpected result. I'm very skeptical about 8+ being correct, the 6.4 sounds much more reasonable.
I don't know anything about multi-GPU so I can't help diagnose the actual problem.
I also assume something weird happens is in addition to the performance problem. 1) This could be related to tensor shapes, when doing ggllm.cpp I had a few fixes/changes in how tensors are split which originally could result in some operations to result zero tensors without error. You could try use a different -ts to see if perplexity reacts on it. If it reacts you'd know it's a tensor shape issue. (and file a dedicated bug)
2) Also the codebase in ggllm.cpp (which was an optimized ggml/cuda from an older version) did not suffer from the same performance degradation, it was maybe 1.5 times slower in multi-gpu than in single GPU (still bad but not 5-10 times slower) There have been a lot of changes in how synchronization is done, how cuda runs. The single GPU speed improved but the multi GPU speed lowered.
3) I recall analzing how broadcasted cuda operations would work and each tensor calculation would involve thousands of loops until finished. Thousands of loops which all had a GPU synchronization call when using tensor split. I'm sure that could be improved by a different method of tackling the operation. The simple solution I suggested (layer split) would replace all the ten thousands of synchronizations with one memory copy at the end of the layer, though I don't know how the performance end-result would be.
I think given the high quality state of llama.cpp and considering new models like llama2 70B and falcon 180B being open for our use it would be quite important to get multi GPU working better, closing the performance gap to python.
The case where they got the unexpected result was for single GPU, as far as I could see. That's what makes it so weird.
Also the codebase in ggllm.cpp (which was an optimized ggml/cuda from an older version) did not suffer from the same performance degradation, it was maybe 1.5 times slower in multi-gpu than in single GPU (still bad but not 5-10 times slower) There have been a lot of changes in how synchronization is done, how cuda runs. The single GPU speed improved but the multi GPU speed lowered.
As I said before:
Also: see #3814 and check whether that PR has unintentionally resulted in a performance regression for your hardware setup.
Regarding multi-GPU:
Regarding the ppl differences:
We need to understand what is going on there.
Regarding the ppl differences:
We need to understand what is going on there.
- @jezzarax could you do a CPU-only run for a few iterations to see if it matches either one of the GPU runs?
- Could someone else run a single-GPU ppl run for this model and post the results?
I can do both, got access to 1x node as well.
Would -ngl 0
work as a CPU-only run, or should I better rebuild from scratch without cuBLAS?
@jezzarax
should I better rebuild from scratch without cuBLAS?
You'd need to build without GPU support, prompt processing (which is all perplexity
does) still uses the GPU even without any layers offloaded.
export CUDA_VISIBLE_DEVICES = "-1"; $env:CUDA_VISIBLE_DEVICES = "-1";
That should enumerate 0 devices to the cuda backend, so nothing could be initialized or sent to a GPU
Likely a bug was introduced in 4760e7cc0b68570d58f55e8dda469805d1759d0d
I made multiple runs over two commits and two quantisation levels. I used some commit from two-ish weeks ago and one from yesterday. It looks like there's something strange about f16 quantisation, q8 results seem more consistent.
GPUs | model | quantization | commit | perplexity | runtime (ms) |
---|---|---|---|---|---|
0 | yi-6b | f16 | 2756c4fbffab097736d5116007872d86456a544a | 8.855 | 1300173.65 |
1 | yi-6b | f16 | 2756c4fbffab097736d5116007872d86456a544a | 8.8942 | |
2 | yi-6b | f16 | 2756c4fbffab097736d5116007872d86456a544a | 6.4202 | 206444.33 |
CPU only build | yi-6b | f16 | 2756c4fbffab097736d5116007872d86456a544a | 6.4308 | 4693429.04 |
0 | yi-6b | q8 | 2756c4fbffab097736d5116007872d86456a544a | 7.509 | 693382.52 |
1 | yi-6b | q8 | 2756c4fbffab097736d5116007872d86456a544a | 7.508 | 92870.73 |
2 | yi-6b | q8 | 2756c4fbffab097736d5116007872d86456a544a | 7.5214 | 191602.34 |
1 | yi-6b | f16 | 6bb4908a17150b49373b5f977685b2e180a04f6f | 8.8942 | 73072.7 |
2 | yi-6b | f16 | 6bb4908a17150b49373b5f977685b2e180a04f6f | 6.4202 | 189718.74 |
CPU only build | yi-6b | f16 | 6bb4908a17150b49373b5f977685b2e180a04f6f | 6.4308 | 4738153.19 |
0 | yi-6b | q8 | 6bb4908a17150b49373b5f977685b2e180a04f6f | 7.5152 | 4091137.73 |
1 | yi-6b | q8 | 6bb4908a17150b49373b5f977685b2e180a04f6f | 7.508 | 94022.87 |
2 | yi-6b | q8 | 6bb4908a17150b49373b5f977685b2e180a04f6f | 7.5215 | 186745.06 |
CPU only build | yi-6b | q8 | 6bb4908a17150b49373b5f977685b2e180a04f6f | 7.5152 | 4089037.75 |
I'm not able to run f16 for the current version of the code on bigger models for now due to https://github.com/ggerganov/llama.cpp/issues/3930#issuecomment-1810597868
If there are any other tests I can run on multi-A100 setup, happy to contribute.
I am not running batch but I obtain performance comparable to exllama on 3090s and the best multi-gpu P40 speeds.
It certainly beats transformers with accelerate or autogptq. I reach speeds similar to metal for large models like falcon with 2 or 3 P40 and 2x3090.
I know that pipeline style approaches were tried with llama_inference_offload in the GPTQ original version. They did speed things up past the normal 2 or 3t/s that would come from using accelerate but nowhere near to this.
This is all using the MMQ kernels though. The new batch kernels did not improve speeds, even on ampere. Could the eventual vulkan backend be faster than cublas?
I am just really confused how people could term multi-gpu in llama.cpp "bad" compared to all the other options. The only time I get slowdowns is prompt processing and I'm not aware how to use the kv_cache token swapping like is done in koboldcpp or if it exists here.
I am not running batch but I obtain performance comparable to exllama on 3090s and the best multi-gpu P40 speeds.
It certainly beats transformers with accelerate or autogptq. I reach speeds similar to metal for large models like falcon with 2 or 3 P40 and 2x3090.
I know that pipeline style approaches were tried with llama_inference_offload in the GPTQ original version. They did speed things up past the normal 2 or 3t/s that would come from using accelerate but nowhere near to this.
This is all using the MMQ kernels though. The new batch kernels did not improve speeds, even on ampere. Could the eventual vulkan backend be faster than cublas?
I am just really confused how people could term multi-gpu in llama.cpp "bad" compared to all the other options. The only time I get slowdowns is prompt processing and I'm not aware how to use the kv_cache token swapping like is done in koboldcpp or if it exists here.
When 2400 tokens/second drops down to 300 tokens/sec despite using twice the processing hardware, and while inferencing the same model we have a problem that needs solving. That's almost a magnitude in performance lost when adding a second card. That was the reason why I raised the topic, the inference speed on multi GPU is by far too slow when using fast GPUs.
I didn't intend to trigger emotions when I used the term "bad" in my later comment, just to point to the problem.
It's not emotion. It's just my experience with it. Splitting a model over multiple GPUs will always lower performance compared to a single GPU with contiguous memory. Have you tried any other inference engine that do not drop so badly and what was the ratio for 1 card vs 2?
It's not only about the performance drop. The numbers differ between single and multi-gpu runs, please check the table I've posted above. Producing correct results is crucial.
Problem: I am aware everyone has different results, in my case I am running llama.cpp on a 4090 primary and a 3090 secondary, so both are quite capable cards for llms. I am getting around 800% slowdowns when using both cards on the same model and settings (basically regardless which model I tried), batch processing speed can go down from 2400t/sec to 200-300t/sec (8-10 times slower than on single GPU). This happens as soon as any tiny bit of processing (-ts) is shifted to the 2nd card.
I assume it is a synchronization problem in the cuda loops, I also assume the issue does not affect every combination of GPUs, especially if one GPU is significantly slower.
Suggestion: My suggestion is to add a parameter like -layer-split, when this is used the tensors are not split up, instead the layers are split into the cards (using -ls instead of -ts). This means the calculations can all be computed without synchronization on a single GPU at highest possible performance of that GPU.
Caveat: In theory tensor split should boost performance, as both cards can process a split tensor at the same time, so it's the better solution but currently that's so far from reality, the suggested layer split should significantly boost the processing speed.
@JohannesGaessler what do you think ?