PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
21.62k stars 5.44k forks source link

install_check.py:265 - PaddlePaddle meets some problem with 2 GPUs. #63386

Open TingquanGao opened 2 weeks ago

TingquanGao commented 2 weeks ago

bug描述 Describe the Bug

显卡驱动和CUDA cudnn nccl 都是对应版本的12.0

Running verify PaddlePaddle program ... I0409 16:04:35.709347 7944 interpretercore.cc:237] New Executor is Running. W0409 16:04:35.710853 7944 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0iver API Version: 12.0, Runtime API Version: 12.0 W0409 16:04:35.711491 7944 gpu_resources.cc:149] device: 0, cuDNN Version: 90.0. I0409 16:04:36.996335 7944 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. Running verify PaddlePaddle program ... Running verify PaddlePaddle program ... I0409 16:04:41.629110 8023 interpretercore.cc:237] New Executor is Running. W0409 16:04:41.630621 8023 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0iver API Version: 12.0, Runtime API Version: 12.0 W0409 16:04:41.631137 8023 gpu_resources.cc:149] device: 0, cuDNN Version: 90.0. I0409 16:04:41.951776 8024 interpretercore.cc:237] New Executor is Running. W0409 16:04:41.953104 8024 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0iver API Version: 12.0, Runtime API Version: 12.0 W0409 16:04:41.953649 8024 gpu_resources.cc:149] device: 0, cuDNN Version: 90.0. I0409 16:04:42.958331 8023 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. [2024-04-09 16:04:42,967] [ WARNING] install_check.py:265 - PaddlePaddle meets some problem with 2 GPThis may be caused by:

  1. There is not enough GPUs visible on your system
  2. Some GPUs are occupied by other process now
  3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://githom/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-g/index.html [2024-04-09 16:04:42,967] [ WARNING] install_check.py:275 - Original Error is: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:
    
        if __name__ == '__main__':
            freeze_support()
            ...
    
    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

    PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddow. Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "/usr/lib/python3.10/runpy.py", line 289, in run_path return _run_module_code(code, init_globals, run_name, File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code _run_code(code, mod_globals, init_globals, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/data/test.py", line 2, in paddle.utils.run_check() File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 282, in run_chec raise e File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 255, in run_chec _run_parallel(device_list) File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 206, in _run_parl paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list)) File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 585, in spawn process.start() File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen return Popen(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch prep_data = spawn.get_preparation_data(process_obj._name) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data _check_not_importing_main() File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:
    
        if __name__ == '__main__':
            freeze_support()
            ...
    
    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

    I0409 16:04:43.284149 8024 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. [2024-04-09 16:04:43,293] [ WARNING] install_check.py:265 - PaddlePaddle meets some problem with 2 GPThis may be caused by:

  4. There is not enough GPUs visible on your system
  5. Some GPUs are occupied by other process now
  6. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://githom/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-g/index.html [2024-04-09 16:04:43,293] [ WARNING] install_check.py:275 - Original Error is: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:
    
        if __name__ == '__main__':
            freeze_support()
            ...
    
    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

    PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddow. Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "/usr/lib/python3.10/runpy.py", line 289, in run_path return _run_module_code(code, init_globals, run_name, File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code _run_code(code, mod_globals, init_globals, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/data/test.py", line 2, in paddle.utils.run_check() File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 282, in run_chec raise e File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 255, in run_chec _run_parallel(device_list) File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 206, in _run_parl paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list)) File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 585, in spawn process.start() File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen return Popen(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch prep_data = spawn.get_preparation_data(process_obj._name) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data _check_not_importing_main() File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:
    
        if __name__ == '__main__':
            freeze_support()
            ...
    
    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

C++ Traceback (most recent call last):

No stack trace in paddle, may be caused by external reasons.


Error Message Summary:

FatalError: Termination signal is detected by the operating system. [TimeInfo: Aborted at 1712649883 (unix time) try "date -d @1712649883" if you are using GNU dat] [SignalInfo: ** SIGTERM (@0x1f08) received by PID 8024 (TID 0x7f27c9971480) from PID 7944 ]

[2024-04-09 16:04:43,870] [ WARNING] install_check.py:265 - PaddlePaddle meets some problem with 2 GPThis may be caused by:

  1. There is not enough GPUs visible on your system
  2. Some GPUs are occupied by other process now
  3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://githom/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-g/index.html [2024-04-09 16:04:43,870] [ WARNING] install_check.py:275 - Original Error is: Process 0 terminated with exit code 1. PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddow. Traceback (most recent call last): File "/data/test.py", line 2, in paddle.utils.run_check() File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 282, in run_chec raise e File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 255, in run_chec _run_parallel(device_list) File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 206, in _run_parl paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list)) File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 595, in spawn while not context.join(): File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 399, in join self._throw_exception(error_index) File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 411, in _throw_excon raise Exception( Exception: Process 0 terminated with exit code 1.

其他补充信息 Additional Supplementary Information

refer https://github.com/PaddlePaddle/PaddleOCR/issues/11901

tianshuo78520a commented 2 weeks ago

你是在自己机器环境上执行paddle.utils.run_check()的命令么? 有尝试在Paddle镜像中测试下paddle.utils.run_check()是否正常么?命令:

  1. nvidia-docker run --name paddle -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 /bin/bash
  2. paddle.utils.run_check()
q465414859 commented 2 weeks ago

你是在自己机器环境上执行paddle.utils.run_check()的命令么? 有尝试在Paddle镜像中测试下paddle.utils.run_check()是否正常么?命令:

  1. nvidia-docker run --name paddle -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 /bin/bash
  2. paddle.utils.run_check()

是在docker中执行的,可以看我上面的 命令 有进入到docker 中

tianshuo78520a commented 2 weeks ago

你用的哪个docker,官方提供的么? 多卡有问题,单卡是否正常?

q465414859 commented 2 weeks ago

你用的哪个docker,官方提供的么? 多卡有问题,单卡是否正常?

官方的12.0的镜像 多卡有问题

q465414859 commented 2 weeks ago

帖子是我的,我不知道官方能不能帮助解决这个问题如果可以请添加我V 18641059137 如果不方便也请回复,非常感谢

tianshuo78520a commented 2 weeks ago

不方便加V,但可以继续在issue中帮助解决,希望再回复下几个问题 问题1:使用镜像是? registry.baidubce.com/paddlepaddle/paddle:2.5.2-gpu-cuda12.0-cudnn8.9-trt8.6 问题2:请问启动镜像命令是? 问题3:分别在物理机,容器里执行nvidia-smi看下效果 问题4:在容器里执行export CUDA_VISIBLE_DEVICES=0 再执行paddle.utils.run_check() 和export CUDA_VISIBLE_DEVICES=0,1多卡执行run_check,截图看下 问题5:df -h 看下磁盘, 以前有遇到过/dev/shm满了导致的多卡失败情况。

q465414859 commented 2 weeks ago

不方便加V,但可以继续在issue中帮助解决,希望再回复下几个问题 问题1:使用镜像是? registry.baidubce.com/paddlepaddle/paddle:2.5.2-gpu-cuda12.0-cudnn8.9-trt8.6 问题2:请问启动镜像命令是? 问题3:分别在物理机,容器里执行nvidia-smi看下效果 问题4:在容器里执行export CUDA_VISIBLE_DEVICES=0 再执行paddle.utils.run_check() 和export CUDA_VISIBLE_DEVICES=0,1多卡执行run_check,截图看下 问题5:df -h 看下磁盘, 以前有遇到过/dev/shm满了导致的多卡失败情况。


λ 179ff181fd82 /home export CUDA_VISIBLE_DEVICES=0 λ 179ff181fd82 /home python test.py grep: warning: GREP_OPTIONS is deprecated; please use an alias or script Running verify PaddlePaddle program ... I0417 03:32:33.312393 143 program_interpreter.cc:212] New Executor is Running. W0417 03:32:33.312819 143 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0 W0417 03:32:33.313778 143 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8. I0417 03:32:35.520290 143 interpreter_util.cc:624] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now. λ 179ff181fd82 /home export CUDA_VISIBLE_DEVICES=0,1 λ 179ff181fd82 /home python test.py grep: warning: GREP_OPTIONS is deprecated; please use an alias or script Running verify PaddlePaddle program ... I0417 03:33:09.172655 217 program_interpreter.cc:212] New Executor is Running. W0417 03:33:09.173103 217 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0 W0417 03:33:09.174027 217 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8. I0417 03:33:11.487715 217 interpreter_util.cc:624] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. grep: grep: warning: GREP_OPTIONS is deprecated; please use an alias or scriptwarning: GREP_OPTIONS is deprecated; please use an alias or script

Running verify PaddlePaddle program ... Running verify PaddlePaddle program ... I0417 03:33:15.548132 292 program_interpreter.cc:212] New Executor is Running. W0417 03:33:15.548656 292 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0 W0417 03:33:15.549595 292 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8. I0417 03:33:15.611618 293 program_interpreter.cc:212] New Executor is Running. W0417 03:33:15.612135 293 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0 W0417 03:33:15.613081 293 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8. I0417 03:33:17.837189 292 interpreter_util.cc:624] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. [2024-04-17 03:33:17,847] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 2 GPUs. This may be caused by:

  1. There is not enough GPUs visible on your system
  2. Some GPUs are occupied by other process now
  3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html

    感谢您的及时回复 问题1:nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 问题2:nvidia-docker run --name paddle -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 /bin/bash 问题3:nvidia-smi 物理机cuda 11.4 虚拟机12.0 问题4:CUDA_VISIBLE_DEVICES=0 执行没问题,CUDA_VISIBLE_DEVICES=0,1 时报错,报错内容在上面 问题5:磁盘空间没问题

    我可以分享我的测试机 进行调试

tianshuo78520a commented 2 weeks ago

docker run 时候添加参数 --shm-size=128G 试下呢

q465414859 commented 1 week ago

docker run 时候添加参数 --shm-size=128G 试下呢

Filesystem Size Used Avail Use% Mounted on tmpfs 38G 1.6M 38G 1% /run /dev/vda1 49G 48G 0 100% / tmpfs 189G 0 189G 0% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 38G 4.0K 38G 1% /run/user/0

这个的 磁盘空间没问题吧?

XieYunshen commented 1 week ago

麻烦看一下是否是某张显卡被占用了呢?可以给一下在物理机和docker容器里面分别执行nvidia-smi命令的截图么?

q465414859 commented 1 week ago

麻烦看一下是否是某张显卡被占用了呢?可以给一下在物理机和docker容器里面分别执行nvidia-smi命令的截图么?


+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ docker内

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A30 On | 00000000:00:0D.0 Off | 0 | | N/A 38C P0 28W / 165W | 0MiB / 24258MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A30 On | 00000000:00:0E.0 Off | 0 | | N/A 39C P0 33W / 165W | 0MiB / 24258MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ 主机

q465414859 commented 1 week ago

docker run 时候添加参数 --shm-size=128G 试下呢


我这次改了docker镜像的存储位置,防止硬盘占满,又重新下载了12.0的镜像,run时 添加了 --shm-size=128G 这个参数,还是同样的问题。nvidia-smi 正常 主机和docker都没有线程占用gpu

tianshuo78520a commented 1 week ago

Driver Version: 470.182.03 不确定是不是这个原因,驱动版本太低了? 尝试用低版本镜像(registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0),paddle.utils.run_check() 能否正常?

q465414859 commented 1 week ago

Driver Version: 470.182.03 不确定是不是这个原因,驱动版本太低了? 尝试用低版本镜像(registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0),paddle.utils.run_check() 能否正常?


Driver Version: 470.182.03 这个是外面驱动的版本,回合这个有什么关系?

tianshuo78520a commented 1 week ago

可能会有关系,镜像中只有cuda,并不会有driver,可以测试下低版本的,是否正常

q465414859 commented 1 week ago

nvidia-smi

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A30 On | 00000000:00:0D.0 Off | 0 | | N/A 38C P0 28W / 165W | 0MiB / 24258MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A30 On | 00000000:00:0E.0 Off | 0 | | N/A 39C P0 33W / 165W | 0MiB / 24258MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ 换到11.2版本 也不行 还是同样的报错

root@ecs-1cf4:~# nvidia-docker run --name paddle -it -v /dev/shm/paddle:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0 /bin/bash

q465414859 commented 1 week ago

可能会有关系,镜像中只有cuda,并不会有driver,可以测试下低版本的,是否正常

还是不正常

q465414859 commented 1 week ago

可能会有关系,镜像中只有cuda,并不会有driver,可以测试下低版本的,是否正常

要不要用我的测试服务器看一下

tianshuo78520a commented 1 week ago

我怎么登陆呢?

q465414859 commented 1 week ago

我怎么登陆呢?

SSH地址 账号密码 可以给你,但好像不太方便 发到这上面。要不加V

q465414859 commented 1 week ago

我怎么登陆呢?

大佬大佬 在忙吗?

q465414859 commented 1 week ago

我怎么登陆呢?

大佬大佬,那边项目还行等,实在是 拜托拜托!!!

tianshuo78520a commented 1 week ago

在python脚本中直接写会失败 import paddle paddle.utils.run_check()

正确写法: if name == 'main': import paddle paddle.utils.run_check()