PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.65k stars 2.87k forks source link

训练coco数据集卡死,if isinstance(item, collections.Sequence) and len(item) == 0: 一直卡住 #3006

Open xupengao opened 3 years ago

xupengao commented 3 years ago
  if isinstance(item, collections.Sequence) and len(item) == 0:
/data/xupengao/PaddleDetection/static/ppdet/data/reader.py:89: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
  if isinstance(item, collections.Sequence) and len(item) == 0:
/data/xupengao/PaddleDetection/static/ppdet/data/reader.py:89: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
  if isinstance(item, collections.Sequence) and len(item) == 0:
/data/xupengao/PaddleDetection/static/ppdet/data/reader.py:89: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
  if isinstance(item, collections.Sequence) and len(item) == 0:

一直卡死在这里,并且GPU显存都占满了

qingqing01 commented 3 years ago

@xupengao 还有其他错误信息吗?您是官网的COCO数据集吗? 机器环境是咋样的?

heavengate commented 3 years ago

你好,训练的是哪个模型呢,以及是自定义数据集么,数据集有啥特点么,看着像是数据集一直没有读出来合法样本

xupengao commented 3 years ago

@xupengao 还有其他错误信息吗?您是官网的COCO数据集吗? 机器环境是咋样的?

没有报错,是自定义数据集转的coco, 用的cuda11, 显卡是3090

xupengao commented 3 years ago

你好,训练的是哪个模型呢,以及是自定义数据集么,数据集有啥特点么,看着像是数据集一直没有读出来合法样本

训练的PPYOLOV2,自定义数据集,转的coco格式,数据集没什么特点,显存占用很大,我把僵死进程杀死,释放显存可以训练,但是偶尔还是会卡住

heavengate commented 3 years ago

PP-YOLOv2的默认配置是在32G V100上训练的,如果你使用的显卡内存低于32G,可以对应调小一下batch size

xupengao commented 3 years ago

PP-YOLOv2的默认配置是在32G V100上训练的,如果你使用的显卡内存低于32G,可以对应调小一下batch size

已经调的很小了,现在不知怎么回事,什么也没改,在静态图上可以训练了,但是迭代一千多次,突然loss都成nan了

[INFO 2021-05-17 08:31:51,932 train.py:302] iter: 1100, lr: 0.005000, 'loss_xy': '1.190825', 'loss_wh': '2.728956', 'loss_obj': '11.797029', 'loss_cls': ' 0.496593', 'loss_iou': '4.351842', 'loss_iou_aware': '0.794992', 'loss': '21.993271', eta: 14:52:15, batch_cost: 0.54131 sec, ips: 3.69473 images/sec 2021-05-17 08:32:02,620 - INFO - iter: 1120, lr: 0.005000, 'loss_xy': '1.227939', 'loss_wh': '2.824287', 'loss_obj': '12.253795', 'loss_cls': '0.534837', 'loss_iou': '4.673498', 'loss_iou_aware': '0.792506', 'loss': '21.761158', eta: 14:45:19, batch_cost: 0.53721 sec, ips: 3.72291 images/sec [INFO 2021-05-17 08:32:02,620 train.py:302] iter: 1120, lr: 0.005000, 'loss_xy': '1.227939', 'loss_wh': '2.824287', 'loss_obj': '12.253795', 'loss_cls': ' 0.534837', 'loss_iou': '4.673498', 'loss_iou_aware': '0.792506', 'loss': '21.761158', eta: 14:45:19, batch_cost: 0.53721 sec, ips: 3.72291 images/sec 2021-05-17 08:32:13,359 - INFO - iter: 1140, lr: 0.005000, 'loss_xy': 'nan', 'loss_wh': 'nan', 'loss_obj': 'nan', 'loss_cls': 'nan', 'loss_iou': 'nan', 'l oss_iou_aware': 'nan', 'loss': 'nan', eta: 14:35:01, batch_cost: 0.53107 sec, ips: 3.76597 images/sec [INFO 2021-05-17 08:32:13,359 train.py:302] iter: 1140, lr: 0.005000, 'loss_xy': 'nan', 'loss_wh': 'nan', 'loss_obj': 'nan', 'loss_cls': 'nan', 'loss_iou' : 'nan', 'loss_iou_aware': 'nan', 'loss': 'nan', eta: 14:35:01, batch_cost: 0.53107 sec, ips: 3.76597 images/sec 2021-05-17 08:32:23,837 - INFO - iter: 1160, lr: 0.005000, 'loss_xy': 'nan', 'loss_wh': 'nan', 'loss_obj': 'nan', 'loss_cls': 'nan', 'loss_iou': 'nan', 'l oss_iou_aware': 'nan', 'loss': 'nan', eta: 14:34:53, batch_cost: 0.53110 sec, ips: 3.76578 images/sec