Open josephrocca opened 2 weeks ago
It is probably an nccl issue. https://github.com/vllm-project/vllm/issues/2826
Could you try this?
export NCCL_P2P_DISABLE=1
Yep! That fixes it. I haven't tested speed yet, but I'm guessing NCCL_P2P_DISABLE=1
will slow down inference? I.e. your suggestion was to help with debugging, rather than as an overall solution?
Also, should I try NCCL_P2P_DISABLE=1
with:
or likely different causes there?
I am not sure if NCCL_P2P_DISABLE
will slow down the inference. I will do the research.
I don't think It is the reason for issue #1744. I need more time to investigate the root cause
Hi, @josephrocca I hope https://github.com/NVIDIA/nccl/issues/631 can help answering the NCCL_P2P_DISABLE issue
Thanks! I tested and did not notice significant performance reduction, so NCCL_P2P_DISABLE=1
seems like a good solution. I wonder if there is a way for LMDeploy to display a good error message that tells the user about NCCL_P2P_DISABLE=1
so they aren't caught by this issue?
Checklist
Describe the bug
I used Runpod to test the current official Docker image (
openmmlab/lmdeploy:v0.4.2
) across several GPUs and host machine CUDA versions:For all of the above machines, the server starts without any errors, but specifically for CUDA 12.3 on 4090s, the server receives requests, and turbomind begins processing it, but it never responds to them, and the GPU stays on 100% utilization.
Reproduction
I used Runpod, and chose the option to filter only CUDA 12.3 machines, and then created a 2x4090 machine with
openmmlab/lmdeploy:v0.4.2
. You can usebash -c "sleep infinity"
as the CMD, and then SSH in and run:Here are the output logs - you can Ctrl+F for "Once upon a" to see the logs for the server receiving the API request:
❌ https://gist.github.com/josephrocca/e7a3c2e469c64226c002a25faf4e7284
And here's
nvidia-smi
for the machine that generated those logs:And here are the logs for the exact same setup, except I used a CUDA 12.4 machine, which works correctly:
✅ https://gist.github.com/josephrocca/54ceebda5feced25614c09535e1a14aa
Environment
Error traceback
Same as linked above: