PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
21.63k stars 5.44k forks source link

paddle.utils.run_check()报错 #63872

Closed wikithink closed 1 week ago

wikithink commented 1 week ago

问题描述 Issue Description

环境说明: docker24.0 Ubuntu20.04 4块3060 cuda11.8 cudnn 8.9 python3.8.19 paddle_gpu_2.6.1 NCCL 2.16.5

安装好了之后,在终端执行: python -c "import paddle; paddle.utils.run_check()" 成功输出 PaddlePaddle works well on 4 GPUs. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

在jupyter的单元格里面输入: import paddle paddle.utils.run_check() 成功输出 PaddlePaddle works well on 4 GPUs. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

但是将上述2行脚本写到test.py文件里面,只有这两行 然后在同一个终端运行:python test.py会报如下错误:

RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

C++ Traceback (most recent call last):

No stack trace in paddle, may be caused by external reasons.


Error Message Summary:

FatalError: Termination signal is detected by the operating system. [TimeInfo: Aborted at 1714037430 (unix time) try "date -d @1714037430" if you are using GNU date ] [SignalInfo: SIGTERM (@0xacdb) received by PID 44332 (TID 0x7f0920465740) from PID 44251 ]

WARNING:root:PaddlePaddle meets some problem with 4 GPUs. This may be caused by:

  1. There is not enough GPUs visible on your system
  2. Some GPUs are occupied by other process now
  3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html WARNING:root: Original Error is: Process 1 terminated with exit code 1. PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now. Traceback (most recent call last): File "/root/paddle/root/dsti/paddle_env_test.py", line 5, in paddle.utils.run_check() File "/usr/local/lib/python3.8/site-packages/paddle/utils/install_check.py", line 302, in run_check raise e File "/usr/local/lib/python3.8/site-packages/paddle/utils/install_check.py", line 283, in run_check _run_parallel(device_list) File "/usr/local/lib/python3.8/site-packages/paddle/utils/install_check.py", line 210, in _run_parallel paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list)) File "/usr/local/lib/python3.8/site-packages/paddle/distributed/spawn.py", line 614, in spawn while not context.join(): File "/usr/local/lib/python3.8/site-packages/paddle/distributed/spawn.py", line 423, in join self._throw_exception(error_index) File "/usr/local/lib/python3.8/site-packages/paddle/distributed/spawn.py", line 435, in _throw_exception raise Exception( Exception: Process 1 terminated with exit code 1.

做过的尝试: 1.重新安装NCCL,并将libnccl.so软连接 ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2.16.5 /usr/lib64/libnccl.so ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2.16.5 /usr/local/bin/libnccl.so

  1. 检查环境变量 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin:/usr/lib64 3.问过chatgpt,没有解决

没有用,一直报错,还请帮帮忙,谢谢!

版本&环境信息 Version & Environment Information

环境说明: docker24.0 Ubuntu20.04 4块3060 cuda11.8 cudnn 8.9 python3.8.19 paddle_gpu_2.6.1 NCCL 2.16.5