PaddlePaddle / PaddleYOLO

🚀🚀🚀 YOLO series of PaddlePaddle implementation, PP-YOLOE+, RT-DETR, YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOX, YOLOv5u, YOLOv7u, YOLOv6Lite, RTMDet and so on. 🚀🚀🚀
https://github.com/PaddlePaddle/PaddleYOLO
GNU General Public License v3.0
534 stars 132 forks source link

单机多卡,拉去gpu版本的docker,进行版本验证时只能使用单张gpu,已安装nccl 2.8.3 #198

Closed overjjjj closed 4 months ago

overjjjj commented 8 months ago

docker创建镜像:docker run --gpus all -it --ipc=host --name it_paddle -h it_paddle_host --shm-size 160G -v /home/guest/workspace/ --workdir=/workspace registry.baidubce.com/paddlepaddle/paddle:2.5.1-gpu-cuda11.2-cudnn8.2-trt8.0 通过解压tzx文件安装nccl2.8.3 进入python解释器后错误信息如下 paddle.utils.run_check() Running verify PaddlePaddle program ... W1030 07:52:57.083344 1992 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.2, Runtime API Version: 10.2 W1030 07:52:57.086133 1992 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. PaddlePaddle works well on 1 GPU. W1030 07:52:57.775635 1992 parallel_executor.cc:642] Cannot enable P2P access from 0 to 1 W1030 07:52:57.775660 1992 parallel_executor.cc:642] Cannot enable P2P access from 0 to 2 W1030 07:52:57.775666 1992 parallel_executor.cc:642] Cannot enable P2P access from 0 to 3 W1030 07:52:57.775669 1992 parallel_executor.cc:642] Cannot enable P2P access from 0 to 4 W1030 07:52:57.775676 1992 parallel_executor.cc:642] Cannot enable P2P access from 0 to 5 W1030 07:52:57.775677 1992 parallel_executor.cc:642] Cannot enable P2P access from 0 to 6 W1030 07:52:57.775682 1992 parallel_executor.cc:642] Cannot enable P2P access from 0 to 7 W1030 07:52:57.775687 1992 parallel_executor.cc:642] Cannot enable P2P access from 1 to 0 W1030 07:52:57.775692 1992 parallel_executor.cc:642] Cannot enable P2P access from 1 to 2 W1030 07:52:57.775696 1992 parallel_executor.cc:642] Cannot enable P2P access from 1 to 3 W1030 07:52:57.775698 1992 parallel_executor.cc:642] Cannot enable P2P access from 1 to 4 W1030 07:52:57.775703 1992 parallel_executor.cc:642] Cannot enable P2P access from 1 to 5 W1030 07:52:57.775709 1992 parallel_executor.cc:642] Cannot enable P2P access from 1 to 6 W1030 07:52:57.775717 1992 parallel_executor.cc:642] Cannot enable P2P access from 1 to 7 W1030 07:52:57.775719 1992 parallel_executor.cc:642] Cannot enable P2P access from 2 to 0 W1030 07:52:57.775723 1992 parallel_executor.cc:642] Cannot enable P2P access from 2 to 1 W1030 07:52:57.775727 1992 parallel_executor.cc:642] Cannot enable P2P access from 2 to 3 W1030 07:52:57.775733 1992 parallel_executor.cc:642] Cannot enable P2P access from 2 to 4 W1030 07:52:57.775738 1992 parallel_executor.cc:642] Cannot enable P2P access from 2 to 5 W1030 07:52:57.775741 1992 parallel_executor.cc:642] Cannot enable P2P access from 2 to 6 W1030 07:52:57.775744 1992 parallel_executor.cc:642] Cannot enable P2P access from 2 to 7 W1030 07:52:57.775748 1992 parallel_executor.cc:642] Cannot enable P2P access from 3 to 0 W1030 07:52:57.775753 1992 parallel_executor.cc:642] Cannot enable P2P access from 3 to 1 W1030 07:52:57.775758 1992 parallel_executor.cc:642] Cannot enable P2P access from 3 to 2 W1030 07:52:57.775763 1992 parallel_executor.cc:642] Cannot enable P2P access from 3 to 4 W1030 07:52:57.775768 1992 parallel_executor.cc:642] Cannot enable P2P access from 3 to 5 W1030 07:52:57.775774 1992 parallel_executor.cc:642] Cannot enable P2P access from 3 to 6 W1030 07:52:57.775779 1992 parallel_executor.cc:642] Cannot enable P2P access from 3 to 7 W1030 07:52:57.775782 1992 parallel_executor.cc:642] Cannot enable P2P access from 4 to 0 W1030 07:52:57.775789 1992 parallel_executor.cc:642] Cannot enable P2P access from 4 to 1 W1030 07:52:57.775792 1992 parallel_executor.cc:642] Cannot enable P2P access from 4 to 2 W1030 07:52:57.775796 1992 parallel_executor.cc:642] Cannot enable P2P access from 4 to 3 W1030 07:52:57.775801 1992 parallel_executor.cc:642] Cannot enable P2P access from 4 to 5 W1030 07:52:57.775806 1992 parallel_executor.cc:642] Cannot enable P2P access from 4 to 6 W1030 07:52:57.775810 1992 parallel_executor.cc:642] Cannot enable P2P access from 4 to 7 W1030 07:52:57.775816 1992 parallel_executor.cc:642] Cannot enable P2P access from 5 to 0 W1030 07:52:57.775821 1992 parallel_executor.cc:642] Cannot enable P2P access from 5 to 1 W1030 07:52:57.775825 1992 parallel_executor.cc:642] Cannot enable P2P access from 5 to 2 W1030 07:52:57.775830 1992 parallel_executor.cc:642] Cannot enable P2P access from 5 to 3 W1030 07:52:57.775835 1992 parallel_executor.cc:642] Cannot enable P2P access from 5 to 4 W1030 07:52:57.775836 1992 parallel_executor.cc:642] Cannot enable P2P access from 5 to 6 W1030 07:52:57.775844 1992 parallel_executor.cc:642] Cannot enable P2P access from 5 to 7 W1030 07:52:57.775849 1992 parallel_executor.cc:642] Cannot enable P2P access from 6 to 0 W1030 07:52:57.775856 1992 parallel_executor.cc:642] Cannot enable P2P access from 6 to 1 W1030 07:52:57.775861 1992 parallel_executor.cc:642] Cannot enable P2P access from 6 to 2 W1030 07:52:57.775864 1992 parallel_executor.cc:642] Cannot enable P2P access from 6 to 3 W1030 07:52:57.775868 1992 parallel_executor.cc:642] Cannot enable P2P access from 6 to 4 W1030 07:52:57.775871 1992 parallel_executor.cc:642] Cannot enable P2P access from 6 to 5 W1030 07:52:57.775875 1992 parallel_executor.cc:642] Cannot enable P2P access from 6 to 7 W1030 07:52:57.775878 1992 parallel_executor.cc:642] Cannot enable P2P access from 7 to 0 W1030 07:52:57.775883 1992 parallel_executor.cc:642] Cannot enable P2P access from 7 to 1 W1030 07:52:57.775888 1992 parallel_executor.cc:642] Cannot enable P2P access from 7 to 2 W1030 07:52:57.775893 1992 parallel_executor.cc:642] Cannot enable P2P access from 7 to 3 W1030 07:52:57.775897 1992 parallel_executor.cc:642] Cannot enable P2P access from 7 to 4 W1030 07:52:57.775902 1992 parallel_executor.cc:642] Cannot enable P2P access from 7 to 5 W1030 07:52:57.775907 1992 parallel_executor.cc:642] Cannot enable P2P access from 7 to 6 W1030 07:53:00.806622 1992 fuse_all_reduce_op_pass.cc:76] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 2. WARNING:root:PaddlePaddle meets some problem with 8 GPUs. This may be caused by:

  1. There is not enough GPUs visible on your system
  2. Some GPUs are occupied by other process now
  3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html WARNING:root: Original Error is: (External) CUDA error(209), no kernel image is available for execution on the device. [Hint: 'cudaErrorNoKernelImageForDevice'. This indicates that there is no kernel image available that is suitable for the device. This can occur when a user specifiescode generation options for a particular CUDA source file that do not include the corresponding device configuration.] (at /paddle/paddle/fluid/framework/details/all_reduce_op_handle.cc:299)

PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.

nemonameless commented 8 months ago

建议拉下稍低版本的paddle docker,在里面重新装其他版本的paddle也是可以的。

rocketbear commented 8 months ago

这个问题碰巧这两天我也遇到了,解决方案是docker run --ipc=host

nemonameless commented 4 months ago

感谢建议。