[Bug] Qwen/Qwen2-VL-7B-Instruct 用--tp 2直接弹出Docker了，不用--tp运行正常。

wangaocheng commented 2 weeks ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

我遇到一个更奇怪的问题，我通过官方提供的Docker openmmlab/lmdeploy:latest安装了lmdeploy，我运行 lmdeploy serve api_server Qwen/Qwen2-VL-7B-Instruct --server-port 6001 --tp 2 的时候，docker 直接弹窗了。

Reproduction

lmdeploy serve api_server Qwen/Qwen2-VL-7B-Instruct --server-port 6001 --tp 2

Environment

Docker Windows 11
GPU NVIDIA GeForce RTX 4090 x 2
内存 128GB

root@1298834253c5:/opt/lmdeploy# lmdeploy check_env
sys.platform: linux
Python: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA GeForce RTX 4090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.4.0+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 90.1  (built against CUDA 12.4)
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.19.0+cu121
LMDeploy: 0.6.1+64c5084
transformers: 4.46.0.dev0
gradio: 5.0.2
fastapi: 0.115.0
pydantic: 2.9.2
triton: 3.0.0
NVIDIA Topology:
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS                             N/A
GPU1    SYS      X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

No response

irexyc commented 2 weeks ago

我们没有windows多卡的机器来测试，一直怀疑 nccl 能否在 wsl 里面运行，现在看起来似乎不可以。

wangaocheng commented 2 weeks ago

我们没有windows多卡的机器来测试，一直怀疑 nccl 能否在 wsl 里面运行，现在看起来似乎不可以。

其他的模型用 --tp 2 没有问题，比如Qwen2.5-14B，但是Qwen2-VL有问题。

irexyc commented 2 weeks ago

@wangaocheng

其他模型也是用的nccl环境么？是 vl 模型么？pytorch backend， turobmind backend 这两个后端都没问题么？(qwen2-vl 是pytorch backend)

github-actions[bot] commented 1 week ago

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

xieyabinfuwu commented 3 days ago

@irexyc Qwen2-VL-7B-Instruct pytorch backend tp=2报错如下，tp=1正常。
环境如下： `(lmdeploy) root@topnet:/data/models/llm# lmdeploy check_env sys.platform: linux Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1: NVIDIA GeForce RTX 4090 CUDA_HOME: /usr/local/cuda-11.8 NVCC: Cuda compilation tools, release 11.8, V11.8.89 GCC: gcc (Ubuntu 10.5.0-1ubuntu1~22.04) 10.5.0 PyTorch: 2.3.1+cu121 PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 12.1
NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
CuDNN 8.9.2
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.18.1+cu121 LMDeploy: 0.6.1+ transformers: 4.45.2 gradio: Not Found fastapi: 0.95.1 pydantic: 1.10.18 triton: 2.3.1 NVIDIA Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-31 0 N/A GPU1 PHB X 0-31 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks`

启动报错：

`lmdeploy) root@topnet:/data/project/topdp-serve-ocr/llm# python main.py

Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46

Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46 Traceback (most recent call last):
File "", line 1, in File "/root/miniconda3/envs/lmdeploy/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/root/miniconda3/envs/lmdeploy/lib/python3.10/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/root/miniconda3/envs/lmdeploy/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/root/miniconda3/envs/lmdeploy/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "/root/miniconda3/envs/lmdeploy/lib/python3.10/runpy.py", line 289, in run_path return _run_module_code(code, init_globals, run_name, File "/root/miniconda3/envs/lmdeploy/lib/python3.10/runpy.py", line 96, in _run_module_code _run_code(code, mod_globals, init_globals, File "/root/miniconda3/envs/lmdeploy/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/data/project/topdp-serve-ocr/llm/main.py", line 8, in from app.controller.information_extraction_qwen2vl import information_extraction_router File "/data/project/topdp-serve-ocr/llm/app/controller/information_extraction_qwen2vl.py", line 43, in pipe = pipeline(model_path, File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/api.py", line 81, in pipeline return pipeline_class(model_path, File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 27, in init super().init(model_path, **kwargs) File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 162, in init self._build_pytorch(model_path=model_path, File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 210, in _build_pytorch self.engine = Engine(model_path=model_path, File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 147, in init self.model_agent = build_model_agent( File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 778, in build_model_agent model_agent = TPModelAgent(model_path, File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 562, in init self._start_sub_process(model_path, File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 597, in _start_sub_process self.mp_context = mp.spawn( File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 228, in start_processes process.start() File "/root/miniconda3/envs/lmdeploy/lib/python3.10/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/root/miniconda3/envs/lmdeploy/lib/python3.10/multiprocessing/context.py", line 288, in _Popen return Popen(process_obj) File "/root/miniconda3/envs/lmdeploy/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/root/miniconda3/envs/lmdeploy/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/root/miniconda3/envs/lmdeploy/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch prep_data = spawn.get_preparation_data(process_obj._name) File "/root/miniconda3/envs/lmdeploy/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data _check_not_importing_main() File "/root/miniconda3/envs/lmdeploy/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.`

InternLM / lmdeploy