InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.08k stars 277 forks source link

[Bug] Official image doesn't work for 4090 on CUDA 12.3 (but works for all other CUDA versions, and works for 12.3 on other GPU types) #1750

Open josephrocca opened 2 weeks ago

josephrocca commented 2 weeks ago

Checklist

Describe the bug

I used Runpod to test the current official Docker image (openmmlab/lmdeploy:v0.4.2) across several GPUs and host machine CUDA versions:

✅ GPU: L40     Driver Version: 525.116.04     CUDA Version: 12.0 
✅ GPU: 2x3090  Driver Version: 525.85.12      CUDA Version: 12.0
✅ GPU: 2x4090  Driver Version: 525.125.06     CUDA Version: 12.0
❌ GPU: 2x4090  Driver Version: 545.29.06      CUDA Version: 12.3
❌ GPU: 2x4090  Driver Version: 545.23.08      CUDA Version: 12.3
✅ GPU: A30     Driver Version: 545.23.08      CUDA Version: 12.3
✅ GPU: L40     Driver Version: 545.23.08      CUDA Version: 12.3
✅ GPU: A6000   Driver Version: 550.54.15      CUDA Version: 12.4
✅ GPU: 2x4090  Driver Version: 550.54.15      CUDA Version: 12.4

For all of the above machines, the server starts without any errors, but specifically for CUDA 12.3 on 4090s, the server receives requests, and turbomind begins processing it, but it never responds to them, and the GPU stays on 100% utilization.

Reproduction

I used Runpod, and chose the option to filter only CUDA 12.3 machines, and then created a 2x4090 machine with openmmlab/lmdeploy:v0.4.2. You can use bash -c "sleep infinity" as the CMD, and then SSH in and run:

huggingface-cli download lmdeploy/llama2-chat-70b-4bit --local-dir /root/llama2-chat-70b-4bit
lmdeploy convert llama2 /root/llama2-chat-70b-4bit --model-format awq --group-size 128 --tp 2 --dst-path /root/turbomind-model-files
lmdeploy serve api_server /root/turbomind-model-files --server-port 3000 --tp 2 --session-len 4096 --model-format awq --model-name lmdeploy/llama2-chat-70b-4bit --enable-prefix-caching --quant-policy 4 --log-level DEBUG

Here are the output logs - you can Ctrl+F for "Once upon a" to see the logs for the server receiving the API request:

https://gist.github.com/josephrocca/e7a3c2e469c64226c002a25faf4e7284

And here's nvidia-smi for the machine that generated those logs:

NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3

And here are the logs for the exact same setup, except I used a CUDA 12.4 machine, which works correctly:

https://gist.github.com/josephrocca/54ceebda5feced25614c09535e1a14aa

Environment

sys.platform: linux
Python: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA GeForce RTX 4090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.1.0+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.0+cu118
LMDeploy: 0.4.2+54b7230
transformers: 4.41.1
gradio: 3.50.2
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.1.0

Error traceback

Same as linked above:

lvhan028 commented 2 weeks ago

It is probably an nccl issue. https://github.com/vllm-project/vllm/issues/2826

Could you try this?

export NCCL_P2P_DISABLE=1
josephrocca commented 2 weeks ago

Yep! That fixes it. I haven't tested speed yet, but I'm guessing NCCL_P2P_DISABLE=1 will slow down inference? I.e. your suggestion was to help with debugging, rather than as an overall solution?

Also, should I try NCCL_P2P_DISABLE=1 with:

or likely different causes there?

lvhan028 commented 2 weeks ago

I am not sure if NCCL_P2P_DISABLE will slow down the inference. I will do the research. I don't think It is the reason for issue #1744. I need more time to investigate the root cause

lvhan028 commented 2 weeks ago

Hi, @josephrocca I hope https://github.com/NVIDIA/nccl/issues/631 can help answering the NCCL_P2P_DISABLE issue

josephrocca commented 2 weeks ago

Thanks! I tested and did not notice significant performance reduction, so NCCL_P2P_DISABLE=1 seems like a good solution. I wonder if there is a way for LMDeploy to display a good error message that tells the user about NCCL_P2P_DISABLE=1 so they aren't caught by this issue?