install_check.py:265 - PaddlePaddle meets some problem with 2 GPUs.

TingquanGao commented 2 weeks ago

bug描述 Describe the Bug

显卡驱动和CUDA cudnn nccl 都是对应版本的12.0

系统环境/System Environment：ubuntu 22. 04
版本号/Version：Paddle： PaddleOCR：问题相关组件/Related components：paddlepaddle-gpu 2.5.2.post120
运行指令/Command Code：paddle.utils.run_check()
完整报错/Complete Error Message：

Running verify PaddlePaddle program ... I0409 16:04:35.709347 7944 interpretercore.cc:237] New Executor is Running. W0409 16:04:35.710853 7944 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0iver API Version: 12.0, Runtime API Version: 12.0 W0409 16:04:35.711491 7944 gpu_resources.cc:149] device: 0, cuDNN Version: 90.0. I0409 16:04:36.996335 7944 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. Running verify PaddlePaddle program ... Running verify PaddlePaddle program ... I0409 16:04:41.629110 8023 interpretercore.cc:237] New Executor is Running. W0409 16:04:41.630621 8023 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0iver API Version: 12.0, Runtime API Version: 12.0 W0409 16:04:41.631137 8023 gpu_resources.cc:149] device: 0, cuDNN Version: 90.0. I0409 16:04:41.951776 8024 interpretercore.cc:237] New Executor is Running. W0409 16:04:41.953104 8024 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0iver API Version: 12.0, Runtime API Version: 12.0 W0409 16:04:41.953649 8024 gpu_resources.cc:149] device: 0, cuDNN Version: 90.0. I0409 16:04:42.958331 8023 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. [2024-04-09 16:04:42,967] [ WARNING] install_check.py:265 - PaddlePaddle meets some problem with 2 GPThis may be caused by:

There is not enough GPUs visible on your system
Some GPUs are occupied by other process now
NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://githom/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-g/index.html [2024-04-09 16:04:42,967] [ WARNING] install_check.py:275 - Original Error is: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
```
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

    if __name__ == '__main__':
        freeze_support()
        ...

The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
```
PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddow. Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "/usr/lib/python3.10/runpy.py", line 289, in run_path return _run_module_code(code, init_globals, run_name, File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code _run_code(code, mod_globals, init_globals, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/data/test.py", line 2, in paddle.utils.run_check() File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 282, in run_chec raise e File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 255, in run_chec _run_parallel(device_list) File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 206, in _run_parl paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list)) File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 585, in spawn process.start() File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen return Popen(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch prep_data = spawn.get_preparation_data(process_obj._name) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data _check_not_importing_main() File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
```
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

    if __name__ == '__main__':
        freeze_support()
        ...

The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
```
I0409 16:04:43.284149 8024 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. [2024-04-09 16:04:43,293] [ WARNING] install_check.py:265 - PaddlePaddle meets some problem with 2 GPThis may be caused by:
There is not enough GPUs visible on your system
Some GPUs are occupied by other process now
NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://githom/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-g/index.html [2024-04-09 16:04:43,293] [ WARNING] install_check.py:275 - Original Error is: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
```
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

    if __name__ == '__main__':
        freeze_support()
        ...

The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
```
PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddow. Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "/usr/lib/python3.10/runpy.py", line 289, in run_path return _run_module_code(code, init_globals, run_name, File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code _run_code(code, mod_globals, init_globals, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/data/test.py", line 2, in paddle.utils.run_check() File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 282, in run_chec raise e File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 255, in run_chec _run_parallel(device_list) File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 206, in _run_parl paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list)) File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 585, in spawn process.start() File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen return Popen(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch prep_data = spawn.get_preparation_data(process_obj._name) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data _check_not_importing_main() File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
```
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

    if __name__ == '__main__':
        freeze_support()
        ...

The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
```

C++ Traceback (most recent call last):

No stack trace in paddle, may be caused by external reasons.

Error Message Summary:

FatalError: Termination signal is detected by the operating system. [TimeInfo: Aborted at 1712649883 (unix time) try "date -d @1712649883" if you are using GNU dat] [SignalInfo: ** SIGTERM (@0x1f08) received by PID 8024 (TID 0x7f27c9971480) from PID 7944 ]

[2024-04-09 16:04:43,870] [ WARNING] install_check.py:265 - PaddlePaddle meets some problem with 2 GPThis may be caused by:

There is not enough GPUs visible on your system
Some GPUs are occupied by other process now
NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://githom/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-g/index.html [2024-04-09 16:04:43,870] [ WARNING] install_check.py:275 - Original Error is: Process 0 terminated with exit code 1. PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddow. Traceback (most recent call last): File "/data/test.py", line 2, in paddle.utils.run_check() File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 282, in run_chec raise e File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 255, in run_chec _run_parallel(device_list) File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 206, in _run_parl paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list)) File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 595, in spawn while not context.join(): File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 399, in join self._throw_exception(error_index) File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 411, in _throw_excon raise Exception( Exception: Process 0 terminated with exit code 1.

其他补充信息 Additional Supplementary Information

refer https://github.com/PaddlePaddle/PaddleOCR/issues/11901

tianshuo78520a commented 2 weeks ago

你是在自己机器环境上执行paddle.utils.run_check()的命令么？有尝试在Paddle镜像中测试下paddle.utils.run_check()是否正常么？命令：

nvidia-docker run --name paddle -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 /bin/bash
paddle.utils.run_check()

q465414859 commented 2 weeks ago

你是在自己机器环境上执行paddle.utils.run_check()的命令么？有尝试在Paddle镜像中测试下paddle.utils.run_check()是否正常么？命令：

nvidia-docker run --name paddle -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 /bin/bash

paddle.utils.run_check()

是在docker中执行的，可以看我上面的命令有进入到docker 中

tianshuo78520a commented 2 weeks ago

你用的哪个docker，官方提供的么？多卡有问题，单卡是否正常？

q465414859 commented 2 weeks ago

你用的哪个docker，官方提供的么？多卡有问题，单卡是否正常？

官方的12.0的镜像多卡有问题

q465414859 commented 2 weeks ago

帖子是我的，我不知道官方能不能帮助解决这个问题如果可以请添加我V 18641059137 如果不方便也请回复，非常感谢

tianshuo78520a commented 2 weeks ago

不方便加V，但可以继续在issue中帮助解决，希望再回复下几个问题问题1：使用镜像是？ registry.baidubce.com/paddlepaddle/paddle:2.5.2-gpu-cuda12.0-cudnn8.9-trt8.6 问题2：请问启动镜像命令是？问题3：分别在物理机，容器里执行nvidia-smi看下效果问题4：在容器里执行export CUDA_VISIBLE_DEVICES=0 再执行paddle.utils.run_check() 和export CUDA_VISIBLE_DEVICES=0,1多卡执行run_check,截图看下问题5：df -h 看下磁盘，以前有遇到过/dev/shm满了导致的多卡失败情况。

q465414859 commented 2 weeks ago

不方便加V，但可以继续在issue中帮助解决，希望再回复下几个问题问题1：使用镜像是？ registry.baidubce.com/paddlepaddle/paddle:2.5.2-gpu-cuda12.0-cudnn8.9-trt8.6 问题2：请问启动镜像命令是？问题3：分别在物理机，容器里执行nvidia-smi看下效果问题4：在容器里执行export CUDA_VISIBLE_DEVICES=0 再执行paddle.utils.run_check() 和export CUDA_VISIBLE_DEVICES=0,1多卡执行run_check,截图看下问题5：df -h 看下磁盘，以前有遇到过/dev/shm满了导致的多卡失败情况。

λ 179ff181fd82 /home export CUDA_VISIBLE_DEVICES=0 λ 179ff181fd82 /home python test.py grep: warning: GREP_OPTIONS is deprecated; please use an alias or script Running verify PaddlePaddle program ... I0417 03:32:33.312393 143 program_interpreter.cc:212] New Executor is Running. W0417 03:32:33.312819 143 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0 W0417 03:32:33.313778 143 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8. I0417 03:32:35.520290 143 interpreter_util.cc:624] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now. λ 179ff181fd82 /home export CUDA_VISIBLE_DEVICES=0,1 λ 179ff181fd82 /home python test.py grep: warning: GREP_OPTIONS is deprecated; please use an alias or script Running verify PaddlePaddle program ... I0417 03:33:09.172655 217 program_interpreter.cc:212] New Executor is Running. W0417 03:33:09.173103 217 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0 W0417 03:33:09.174027 217 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8. I0417 03:33:11.487715 217 interpreter_util.cc:624] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. grep: grep: warning: GREP_OPTIONS is deprecated; please use an alias or scriptwarning: GREP_OPTIONS is deprecated; please use an alias or script

Running verify PaddlePaddle program ... Running verify PaddlePaddle program ... I0417 03:33:15.548132 292 program_interpreter.cc:212] New Executor is Running. W0417 03:33:15.548656 292 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0 W0417 03:33:15.549595 292 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8. I0417 03:33:15.611618 293 program_interpreter.cc:212] New Executor is Running. W0417 03:33:15.612135 293 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0 W0417 03:33:15.613081 293 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8. I0417 03:33:17.837189 292 interpreter_util.cc:624] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. [2024-04-17 03:33:17,847] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 2 GPUs. This may be caused by:

There is not enough GPUs visible on your system
Some GPUs are occupied by other process now
NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html

感谢您的及时回复问题1：nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 问题2：nvidia-docker run --name paddle -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 /bin/bash 问题3：nvidia-smi 物理机cuda 11.4 虚拟机12.0 问题4：CUDA_VISIBLE_DEVICES=0 执行没问题，CUDA_VISIBLE_DEVICES=0,1 时报错，报错内容在上面问题5：磁盘空间没问题

我可以分享我的测试机进行调试

tianshuo78520a commented 2 weeks ago

docker run 时候添加参数 --shm-size=128G 试下呢

q465414859 commented 1 week ago

docker run 时候添加参数 --shm-size=128G 试下呢

Filesystem Size Used Avail Use% Mounted on tmpfs 38G 1.6M 38G 1% /run /dev/vda1 49G 48G 0 100% / tmpfs 189G 0 189G 0% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 38G 4.0K 38G 1% /run/user/0

这个的磁盘空间没问题吧？

XieYunshen commented 1 week ago

麻烦看一下是否是某张显卡被占用了呢？可以给一下在物理机和docker容器里面分别执行nvidia-smi命令的截图么？

q465414859 commented 1 week ago

麻烦看一下是否是某张显卡被占用了呢？可以给一下在物理机和docker容器里面分别执行nvidia-smi命令的截图么？

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ docker内

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ 主机

q465414859 commented 1 week ago

docker run 时候添加参数 --shm-size=128G 试下呢

我这次改了docker镜像的存储位置，防止硬盘占满，又重新下载了12.0的镜像，run时添加了 --shm-size=128G 这个参数，还是同样的问题。nvidia-smi 正常主机和docker都没有线程占用gpu

tianshuo78520a commented 1 week ago

Driver Version: 470.182.03 不确定是不是这个原因，驱动版本太低了？尝试用低版本镜像(registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0)，paddle.utils.run_check() 能否正常？

q465414859 commented 1 week ago

Driver Version: 470.182.03 不确定是不是这个原因，驱动版本太低了？尝试用低版本镜像(registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0)，paddle.utils.run_check() 能否正常？

Driver Version: 470.182.03 这个是外面驱动的版本,回合这个有什么关系?

tianshuo78520a commented 1 week ago

可能会有关系，镜像中只有cuda,并不会有driver,可以测试下低版本的，是否正常

q465414859 commented 1 week ago

nvidia-smi

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ 换到11.2版本也不行还是同样的报错

root@ecs-1cf4:~# nvidia-docker run --name paddle -it -v /dev/shm/paddle:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0 /bin/bash

q465414859 commented 1 week ago

可能会有关系，镜像中只有cuda,并不会有driver,可以测试下低版本的，是否正常

还是不正常

q465414859 commented 1 week ago

可能会有关系，镜像中只有cuda,并不会有driver,可以测试下低版本的，是否正常

要不要用我的测试服务器看一下

tianshuo78520a commented 1 week ago

我怎么登陆呢？

q465414859 commented 1 week ago

我怎么登陆呢？

SSH地址账号密码可以给你，但好像不太方便发到这上面。要不加V

q465414859 commented 1 week ago

我怎么登陆呢？

大佬大佬在忙吗？

q465414859 commented 1 week ago

我怎么登陆呢？

大佬大佬，那边项目还行等，实在是拜托拜托！！！

tianshuo78520a commented 1 week ago

在python脚本中直接写会失败 import paddle paddle.utils.run_check()

正确写法： if name == 'main': import paddle paddle.utils.run_check()

PaddlePaddle / Paddle