多卡训练卡住不继续运行

问题确认 Search before asking

[X] 我已经查询历史issue，没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

No response

Bug描述 Describe the Bug

在进行多卡训练时，卡在W0913 14:56:13.162582 18935 gpu_resources.cc:149] device: 0, cuDNN Version: 8.8.之后就没有新的打印信息了，也不会继续运行。

训练命令：

(paddleseg) kb@gpu01:/data-r10/kb/Projects/PaddleYOLO$ python -m paddle.distributed.launch --log_dir=./log_vima_dir --gpus 0,1,2,3,4,5,6,7 tools/train.py -c ./configs/yolov5/yolov5_s_80e_ssod_finetune_vima_coco.yml --eval --amp --use_vdl=True --vdl_log_dir=vdl_vima_dir/scalar

打印信息：

LAUNCH INFO 2023-09-13 14:56:09,577 -----------  Configuration  ----------------------
LAUNCH INFO 2023-09-13 14:56:09,577 auto_parallel_config: None
LAUNCH INFO 2023-09-13 14:56:09,577 devices: 0,1,2,3,4,5,6,7
LAUNCH INFO 2023-09-13 14:56:09,577 elastic_level: -1
LAUNCH INFO 2023-09-13 14:56:09,577 elastic_timeout: 30
LAUNCH INFO 2023-09-13 14:56:09,577 gloo_port: 6767
LAUNCH INFO 2023-09-13 14:56:09,577 host: None
LAUNCH INFO 2023-09-13 14:56:09,577 ips: None
LAUNCH INFO 2023-09-13 14:56:09,577 job_id: default
LAUNCH INFO 2023-09-13 14:56:09,577 legacy: False
LAUNCH INFO 2023-09-13 14:56:09,577 log_dir: ./log_vima_dir
LAUNCH INFO 2023-09-13 14:56:09,577 log_level: INFO
LAUNCH INFO 2023-09-13 14:56:09,577 log_overwrite: False
LAUNCH INFO 2023-09-13 14:56:09,577 master: None
LAUNCH INFO 2023-09-13 14:56:09,577 max_restart: 3
LAUNCH INFO 2023-09-13 14:56:09,577 nnodes: 1
LAUNCH INFO 2023-09-13 14:56:09,577 nproc_per_node: None
LAUNCH INFO 2023-09-13 14:56:09,577 rank: -1
LAUNCH INFO 2023-09-13 14:56:09,577 run_mode: collective
LAUNCH INFO 2023-09-13 14:56:09,577 server_num: None
LAUNCH INFO 2023-09-13 14:56:09,577 servers: 
LAUNCH INFO 2023-09-13 14:56:09,578 start_port: 6070
LAUNCH INFO 2023-09-13 14:56:09,578 trainer_num: None
LAUNCH INFO 2023-09-13 14:56:09,578 trainers: 
LAUNCH INFO 2023-09-13 14:56:09,578 training_script: tools/train.py
LAUNCH INFO 2023-09-13 14:56:09,578 training_script_args: ['-c', './configs/yolov5/yolov5_s_80e_ssod_finetune_vima_coco.yml', '--eval', '--amp', '--use_vdl=True', '--vdl_log_dir=vdl_vima_dir/scalar']
LAUNCH INFO 2023-09-13 14:56:09,578 with_gloo: 1
LAUNCH INFO 2023-09-13 14:56:09,578 --------------------------------------------------
LAUNCH INFO 2023-09-13 14:56:09,579 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2023-09-13 14:56:09,590 Run Pod: rqvwxn, replicas 8, status ready
LAUNCH INFO 2023-09-13 14:56:09,744 Watching Pod: rqvwxn, replicas 8, status running
Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0913 14:56:11.313481 18935 tcp_utils.cc:181] The server starts to listen on IP_ANY:41751
I0913 14:56:11.313777 18935 tcp_utils.cc:130] Successfully connected to 10.3.15.202:41751
W0913 14:56:13.161374 18935 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website.
W0913 14:56:13.161420 18935 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.7
W0913 14:56:13.162582 18935 gpu_resources.cc:149] device: 0, cuDNN Version: 8.8.

GPU信息：

Wed Sep 13 15:01:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:1A:00.0 Off |                  N/A |
| 34%   60C    P2    79W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 36%   63C    P2    75W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:3D:00.0 Off |                  N/A |
| 29%   52C    P2    73W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:3E:00.0 Off |                  N/A |
| 33%   59C    P2    79W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  Off  | 00000000:88:00.0 Off |                  N/A |
| 31%   55C    P2    82W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  Off  | 00000000:89:00.0 Off |                  N/A |
| 37%   64C    P2    79W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  Off  | 00000000:B1:00.0 Off |                  N/A |
| 31%   55C    P2    77W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  Off  | 00000000:B2:00.0 Off |                  N/A |
| 32%   57C    P2    75W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     18935      C   ...envs/paddleseg/bin/python      203MiB |
|    1   N/A  N/A     18937      C   ...envs/paddleseg/bin/python      203MiB |
|    2   N/A  N/A     18939      C   ...envs/paddleseg/bin/python      203MiB |
|    3   N/A  N/A     18941      C   ...envs/paddleseg/bin/python      203MiB |
|    4   N/A  N/A     18946      C   ...envs/paddleseg/bin/python      203MiB |
|    5   N/A  N/A     18948      C   ...envs/paddleseg/bin/python      203MiB |
|    6   N/A  N/A     18954      C   ...envs/paddleseg/bin/python      203MiB |
|    7   N/A  N/A     18957      C   ...envs/paddleseg/bin/python      203MiB |
+-----------------------------------------------------------------------------+

使用htop查看CPU的使用情况，固定的几个核心飙升到100%不下降。

复现环境 Environment

paddlepaddle-gpu 2.5.1.post117 pypi_0 pypi cudatoolkit 11.7.0 hd8887f6_10 nvidia

Bug描述确认 Bug description confirmation

[X] 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息，确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR？ Are you willing to submit a PR?

[ ] 我愿意提交PR！I'd like to help by submitting a PR!

PaddlePaddle / PaddleYOLO

多卡训练卡住不继续运行 #188