🚀🚀🚀 YOLO series of PaddlePaddle implementation, PP-YOLOE+, RT-DETR, YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv10, YOLOX, YOLOv5u, YOLOv7u, YOLOv6Lite, RTMDet and so on. 🚀🚀🚀
LAUNCH INFO 2023-09-13 14:56:09,577 ----------- Configuration ----------------------
LAUNCH INFO 2023-09-13 14:56:09,577 auto_parallel_config: None
LAUNCH INFO 2023-09-13 14:56:09,577 devices: 0,1,2,3,4,5,6,7
LAUNCH INFO 2023-09-13 14:56:09,577 elastic_level: -1
LAUNCH INFO 2023-09-13 14:56:09,577 elastic_timeout: 30
LAUNCH INFO 2023-09-13 14:56:09,577 gloo_port: 6767
LAUNCH INFO 2023-09-13 14:56:09,577 host: None
LAUNCH INFO 2023-09-13 14:56:09,577 ips: None
LAUNCH INFO 2023-09-13 14:56:09,577 job_id: default
LAUNCH INFO 2023-09-13 14:56:09,577 legacy: False
LAUNCH INFO 2023-09-13 14:56:09,577 log_dir: ./log_vima_dir
LAUNCH INFO 2023-09-13 14:56:09,577 log_level: INFO
LAUNCH INFO 2023-09-13 14:56:09,577 log_overwrite: False
LAUNCH INFO 2023-09-13 14:56:09,577 master: None
LAUNCH INFO 2023-09-13 14:56:09,577 max_restart: 3
LAUNCH INFO 2023-09-13 14:56:09,577 nnodes: 1
LAUNCH INFO 2023-09-13 14:56:09,577 nproc_per_node: None
LAUNCH INFO 2023-09-13 14:56:09,577 rank: -1
LAUNCH INFO 2023-09-13 14:56:09,577 run_mode: collective
LAUNCH INFO 2023-09-13 14:56:09,577 server_num: None
LAUNCH INFO 2023-09-13 14:56:09,577 servers:
LAUNCH INFO 2023-09-13 14:56:09,578 start_port: 6070
LAUNCH INFO 2023-09-13 14:56:09,578 trainer_num: None
LAUNCH INFO 2023-09-13 14:56:09,578 trainers:
LAUNCH INFO 2023-09-13 14:56:09,578 training_script: tools/train.py
LAUNCH INFO 2023-09-13 14:56:09,578 training_script_args: ['-c', './configs/yolov5/yolov5_s_80e_ssod_finetune_vima_coco.yml', '--eval', '--amp', '--use_vdl=True', '--vdl_log_dir=vdl_vima_dir/scalar']
LAUNCH INFO 2023-09-13 14:56:09,578 with_gloo: 1
LAUNCH INFO 2023-09-13 14:56:09,578 --------------------------------------------------
LAUNCH INFO 2023-09-13 14:56:09,579 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2023-09-13 14:56:09,590 Run Pod: rqvwxn, replicas 8, status ready
LAUNCH INFO 2023-09-13 14:56:09,744 Watching Pod: rqvwxn, replicas 8, status running
Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0913 14:56:11.313481 18935 tcp_utils.cc:181] The server starts to listen on IP_ANY:41751
I0913 14:56:11.313777 18935 tcp_utils.cc:130] Successfully connected to 10.3.15.202:41751
W0913 14:56:13.161374 18935 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website.
W0913 14:56:13.161420 18935 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.7
W0913 14:56:13.162582 18935 gpu_resources.cc:149] device: 0, cuDNN Version: 8.8.
[X] 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.
问题确认 Search before asking
Bug组件 Bug Component
No response
Bug描述 Describe the Bug
在进行多卡训练时,卡在
W0913 14:56:13.162582 18935 gpu_resources.cc:149] device: 0, cuDNN Version: 8.8.
之后就没有新的打印信息了,也不会继续运行。训练命令:
打印信息:
GPU信息:
使用
htop
查看CPU的使用情况,固定的几个核心飙升到100%不下降。复现环境 Environment
paddlepaddle-gpu 2.5.1.post117 pypi_0 pypi cudatoolkit 11.7.0 hd8887f6_10 nvidia
Bug描述确认 Bug description confirmation
是否愿意提交PR? Are you willing to submit a PR?