Open TingquanGao opened 2 weeks ago
你是在自己机器环境上执行paddle.utils.run_check()的命令么? 有尝试在Paddle镜像中测试下paddle.utils.run_check()是否正常么?命令:
你是在自己机器环境上执行paddle.utils.run_check()的命令么? 有尝试在Paddle镜像中测试下paddle.utils.run_check()是否正常么?命令:
- nvidia-docker run --name paddle -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 /bin/bash
- paddle.utils.run_check()
是在docker中执行的,可以看我上面的 命令 有进入到docker 中
你用的哪个docker,官方提供的么? 多卡有问题,单卡是否正常?
你用的哪个docker,官方提供的么? 多卡有问题,单卡是否正常?
官方的12.0的镜像 多卡有问题
帖子是我的,我不知道官方能不能帮助解决这个问题如果可以请添加我V 18641059137 如果不方便也请回复,非常感谢
不方便加V,但可以继续在issue中帮助解决,希望再回复下几个问题 问题1:使用镜像是? registry.baidubce.com/paddlepaddle/paddle:2.5.2-gpu-cuda12.0-cudnn8.9-trt8.6 问题2:请问启动镜像命令是? 问题3:分别在物理机,容器里执行nvidia-smi看下效果 问题4:在容器里执行export CUDA_VISIBLE_DEVICES=0 再执行paddle.utils.run_check() 和export CUDA_VISIBLE_DEVICES=0,1多卡执行run_check,截图看下 问题5:df -h 看下磁盘, 以前有遇到过/dev/shm满了导致的多卡失败情况。
不方便加V,但可以继续在issue中帮助解决,希望再回复下几个问题 问题1:使用镜像是? registry.baidubce.com/paddlepaddle/paddle:2.5.2-gpu-cuda12.0-cudnn8.9-trt8.6 问题2:请问启动镜像命令是? 问题3:分别在物理机,容器里执行nvidia-smi看下效果 问题4:在容器里执行export CUDA_VISIBLE_DEVICES=0 再执行paddle.utils.run_check() 和export CUDA_VISIBLE_DEVICES=0,1多卡执行run_check,截图看下 问题5:df -h 看下磁盘, 以前有遇到过/dev/shm满了导致的多卡失败情况。
λ 179ff181fd82 /home export CUDA_VISIBLE_DEVICES=0 λ 179ff181fd82 /home python test.py grep: warning: GREP_OPTIONS is deprecated; please use an alias or script Running verify PaddlePaddle program ... I0417 03:32:33.312393 143 program_interpreter.cc:212] New Executor is Running. W0417 03:32:33.312819 143 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0 W0417 03:32:33.313778 143 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8. I0417 03:32:35.520290 143 interpreter_util.cc:624] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now. λ 179ff181fd82 /home export CUDA_VISIBLE_DEVICES=0,1 λ 179ff181fd82 /home python test.py grep: warning: GREP_OPTIONS is deprecated; please use an alias or script Running verify PaddlePaddle program ... I0417 03:33:09.172655 217 program_interpreter.cc:212] New Executor is Running. W0417 03:33:09.173103 217 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0 W0417 03:33:09.174027 217 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8. I0417 03:33:11.487715 217 interpreter_util.cc:624] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. grep: grep: warning: GREP_OPTIONS is deprecated; please use an alias or scriptwarning: GREP_OPTIONS is deprecated; please use an alias or script
Running verify PaddlePaddle program ... Running verify PaddlePaddle program ... I0417 03:33:15.548132 292 program_interpreter.cc:212] New Executor is Running. W0417 03:33:15.548656 292 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0 W0417 03:33:15.549595 292 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8. I0417 03:33:15.611618 293 program_interpreter.cc:212] New Executor is Running. W0417 03:33:15.612135 293 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 12.0, Runtime API Version: 12.0 W0417 03:33:15.613081 293 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8. I0417 03:33:17.837189 292 interpreter_util.cc:624] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. [2024-04-17 03:33:17,847] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 2 GPUs. This may be caused by:
我可以分享我的测试机 进行调试
docker run 时候添加参数 --shm-size=128G 试下呢
docker run 时候添加参数 --shm-size=128G 试下呢
Filesystem Size Used Avail Use% Mounted on tmpfs 38G 1.6M 38G 1% /run /dev/vda1 49G 48G 0 100% / tmpfs 189G 0 189G 0% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 38G 4.0K 38G 1% /run/user/0
这个的 磁盘空间没问题吧?
麻烦看一下是否是某张显卡被占用了呢?可以给一下在物理机和docker容器里面分别执行nvidia-smi
命令的截图么?
麻烦看一下是否是某张显卡被占用了呢?可以给一下在物理机和docker容器里面分别执行
nvidia-smi
命令的截图么?
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A30 On | 00000000:00:0D.0 Off | 0 | | N/A 38C P0 28W / 165W | 0MiB / 24258MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A30 On | 00000000:00:0E.0 Off | 0 | | N/A 39C P0 33W / 165W | 0MiB / 24258MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ 主机
docker run 时候添加参数 --shm-size=128G 试下呢
我这次改了docker镜像的存储位置,防止硬盘占满,又重新下载了12.0的镜像,run时 添加了 --shm-size=128G 这个参数,还是同样的问题。nvidia-smi 正常 主机和docker都没有线程占用gpu
Driver Version: 470.182.03 不确定是不是这个原因,驱动版本太低了? 尝试用低版本镜像(registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0),paddle.utils.run_check() 能否正常?
Driver Version: 470.182.03 不确定是不是这个原因,驱动版本太低了? 尝试用低版本镜像(registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0),paddle.utils.run_check() 能否正常?
Driver Version: 470.182.03 这个是外面驱动的版本,回合这个有什么关系?
可能会有关系,镜像中只有cuda,并不会有driver,可以测试下低版本的,是否正常
nvidia-smi
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A30 On | 00000000:00:0D.0 Off | 0 | | N/A 38C P0 28W / 165W | 0MiB / 24258MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A30 On | 00000000:00:0E.0 Off | 0 | | N/A 39C P0 33W / 165W | 0MiB / 24258MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ 换到11.2版本 也不行 还是同样的报错
root@ecs-1cf4:~# nvidia-docker run --name paddle -it -v /dev/shm/paddle:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0 /bin/bash
可能会有关系,镜像中只有cuda,并不会有driver,可以测试下低版本的,是否正常
还是不正常
可能会有关系,镜像中只有cuda,并不会有driver,可以测试下低版本的,是否正常
要不要用我的测试服务器看一下
我怎么登陆呢?
我怎么登陆呢?
SSH地址 账号密码 可以给你,但好像不太方便 发到这上面。要不加V
我怎么登陆呢?
大佬大佬 在忙吗?
我怎么登陆呢?
大佬大佬,那边项目还行等,实在是 拜托拜托!!!
在python脚本中直接写会失败 import paddle paddle.utils.run_check()
正确写法: if name == 'main': import paddle paddle.utils.run_check()
bug描述 Describe the Bug
显卡驱动和CUDA cudnn nccl 都是对应版本的12.0
Running verify PaddlePaddle program ... I0409 16:04:35.709347 7944 interpretercore.cc:237] New Executor is Running. W0409 16:04:35.710853 7944 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0iver API Version: 12.0, Runtime API Version: 12.0 W0409 16:04:35.711491 7944 gpu_resources.cc:149] device: 0, cuDNN Version: 90.0. I0409 16:04:36.996335 7944 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. Running verify PaddlePaddle program ... Running verify PaddlePaddle program ... I0409 16:04:41.629110 8023 interpretercore.cc:237] New Executor is Running. W0409 16:04:41.630621 8023 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0iver API Version: 12.0, Runtime API Version: 12.0 W0409 16:04:41.631137 8023 gpu_resources.cc:149] device: 0, cuDNN Version: 90.0. I0409 16:04:41.951776 8024 interpretercore.cc:237] New Executor is Running. W0409 16:04:41.953104 8024 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0iver API Version: 12.0, Runtime API Version: 12.0 W0409 16:04:41.953649 8024 gpu_resources.cc:149] device: 0, cuDNN Version: 90.0. I0409 16:04:42.958331 8023 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. [2024-04-09 16:04:42,967] [ WARNING] install_check.py:265 - PaddlePaddle meets some problem with 2 GPThis may be caused by:
NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://githom/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-g/index.html [2024-04-09 16:04:42,967] [ WARNING] install_check.py:275 - Original Error is: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddow. Traceback (most recent call last): File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/data/test.py", line 2, in
paddle.utils.run_check()
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 282, in run_chec
raise e
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 255, in run_chec
_run_parallel(device_list)
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 206, in _run_parl
paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 585, in spawn
process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
I0409 16:04:43.284149 8024 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. [2024-04-09 16:04:43,293] [ WARNING] install_check.py:265 - PaddlePaddle meets some problem with 2 GPThis may be caused by:
NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://githom/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-g/index.html [2024-04-09 16:04:43,293] [ WARNING] install_check.py:275 - Original Error is: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddow. Traceback (most recent call last): File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/data/test.py", line 2, in
paddle.utils.run_check()
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 282, in run_chec
raise e
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 255, in run_chec
_run_parallel(device_list)
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 206, in _run_parl
paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 585, in spawn
process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
C++ Traceback (most recent call last):
No stack trace in paddle, may be caused by external reasons.
Error Message Summary:
FatalError:
Termination signal
is detected by the operating system. [TimeInfo: Aborted at 1712649883 (unix time) try "date -d @1712649883" if you are using GNU dat] [SignalInfo: ** SIGTERM (@0x1f08) received by PID 8024 (TID 0x7f27c9971480) from PID 7944 ][2024-04-09 16:04:43,870] [ WARNING] install_check.py:265 - PaddlePaddle meets some problem with 2 GPThis may be caused by:
其他补充信息 Additional Supplementary Information
refer https://github.com/PaddlePaddle/PaddleOCR/issues/11901