PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.87k stars 2.9k forks source link

PPYOLOE+_S代码在paddlecloud上跑了一半报错 #7968

Open upupbo opened 1 year ago

upupbo commented 1 year ago

问题确认 Search before asking

Bug组件 Bug Component

No response

Bug描述 Describe the Bug

paddledetection--2.5,paddle--2.3.2

报错信息: [03/20 08:35:03] ppdet.engine INFO: Epoch: [33] [ 80/694] learning_rate: 0.003458 loss: 1.395932 loss_cls: 0.707818 loss_iou: 0.139823 loss_dfl: 0.689462 loss_l1: 0.248278 eta: 49 days, 21:30:41 batch_cost: 1.4332 data_cost: 1.0170 ips: 50.2374 images/s INFO 2023-03-20 08:36:49,535 launch_utils.py:343] terminate all the procs INFO 2023-03-20 08:36:49,535 launch_utils.py:343] terminate all the procs ERROR 2023-03-20 08:36:49,536 launch_utils.py:642] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log. ERROR 2023-03-20 08:36:49,536 launch_utils.py:642] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log. INFO 2023-03-20 08:36:53,540 launch_utils.py:343] terminate all the procs INFO 2023-03-20 08:36:53,540 launch_utils.py:343] terminate all the procs INFO 2023-03-20 08:36:53,541 launch.py:402] Local processes completed. INFO 2023-03-20 08:36:53,541 launch.py:402] Local processes completed. [03/20 08:35:14] ppdet.engine INFO: Epoch: [33] [ 90/694] learning_rate: 0.003456 loss: 1.405060 loss_cls: 0.713611 loss_iou: 0.136420 loss_dfl: 0.685583 loss_l1: 0.254494 eta: 49 days, 21:26:37 batch_cost: 1.0965 data_cost: 0.6532 ips: 65.6656 images/s [03/20 08:35:25] ppdet.engine INFO: Epoch: [33] [100/694] learning_rate: 0.003455 loss: 1.362793 loss_cls: 0.701447 loss_iou: 0.128416 loss_dfl: 0.681726 loss_l1: 0.237298 eta: 49 days, 21:21:07 batch_cost: 1.0383 data_cost: 0.5575 ips: 69.3449 images/s [03/20 08:35:35] ppdet.engine INFO: Epoch: [33] [110/694] learning_rate: 0.003453 loss: 1.440801 loss_cls: 0.733980 loss_iou: 0.142318 loss_dfl: 0.692908 loss_l1: 0.267232 eta: 49 days, 21:12:15 batch_cost: 0.9041 data_cost: 0.4652 ips: 79.6366 images/s [03/20 08:35:52] ppdet.engine INFO: Epoch: [33] [120/694] learning_rate: 0.003452 loss: 1.365357 loss_cls: 0.698634 loss_iou: 0.129954 loss_dfl: 0.699078 loss_l1: 0.247817 eta: 49 days, 21:22:19 batch_cost: 1.6618 data_cost: 1.1347 ips: 43.3277 images/s [03/20 08:36:02] ppdet.engine INFO: Epoch: [33] [130/694] learning_rate: 0.003451 loss: 1.401229 loss_cls: 0.714009 loss_iou: 0.138751 loss_dfl: 0.700308 loss_l1: 0.267538 eta: 49 days, 21:13:48 batch_cost: 0.9173 data_cost: 0.4892 ips: 78.4870 images/s [03/20 08:36:12] ppdet.engine INFO: Epoch: [33] [140/694] learning_rate: 0.003449 loss: 1.397750 loss_cls: 0.713928 loss_iou: 0.134276 loss_dfl: 0.687129 loss_l1: 0.262884 eta: 49 days, 21:06:56 batch_cost: 0.9833 data_cost: 0.5175 ips: 73.2260 images/s [03/20 08:36:25] ppdet.engine INFO: Epoch: [33] [150/694] learning_rate: 0.003448 loss: 1.415287 loss_cls: 0.716078 loss_iou: 0.139067 loss_dfl: 0.693288 loss_l1: 0.263905 eta: 49 days, 21:05:37 batch_cost: 1.2060 data_cost: 0.8072 ips: 59.6991 images/s Traceback (most recent call last): File "tools/train.py", line 172, in main() File "tools/train.py", line 168, in main run(FLAGS, cfg) File "tools/train.py", line 132, in run trainer.train(FLAGS.eval) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/engine/trainer.py", line 485, in train outputs = model(data) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, kwargs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, *kwargs) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/architectures/meta_arch.py", line 59, in forward out = self.get_loss() File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/architectures/yolo.py", line 124, in get_loss return self._forward() File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/architectures/yolo.py", line 88, in _forward yolo_losses = self.yolo_head(neck_feats, self.inputs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(inputs, kwargs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, kwargs) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/heads/ppyoloe_head.py", line 219, in forward return self.forward_train(feats, targets) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/heads/ppyoloe_head.py", line 164, in forward_train ], targets) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/heads/ppyoloe_head.py", line 333, in get_loss bg_index=self.num_classes) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, *kwargs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(inputs, kwargs) File "", line 2, in forward File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/base.py", line 354, in _decorate_function return func(*args, *kwargs) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/assigners/task_aligned_assigner.py", line 115, in forward alignment_metrics is_in_gts, self.topk, topk_mask=pad_gt_mask) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/assigners/utils.py", line 105, in gather_topk_anchors return is_in_topk * topk_mask File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/math_op_patch.py", line 299, in impl return math_op(self, other_var, 'axis', axis) OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:258) [operator < elementwise_mul > error]

复现环境 Environment

paddledetection--2.5,paddle--2.3.2,cuda-11.2

Bug描述确认 Bug description confirmation

是否愿意提交PR? Are you willing to submit a PR?

upupbo commented 1 year ago

同样的网络,我改小了batch,还是在31个epoch的时候断了

upupbo commented 1 year ago

上面用的学习率变化是CosineDecay,我发现PiecewiseDecay也报错了,也是在第33个epoch报错 [03/20 08:36:25] ppdet.engine INFO: Epoch: [33] [150/694] learning_rate: 0.003448 loss: 1.415287 loss_cls: 0.716078 loss_iou: 0.139067 loss_dfl: 0.693288 loss_l1: 0.263905 eta: 49 days, 21:05:37 batch_cost: 1.2060 data_cost: 0.8072 ips: 59.6991 images/s Traceback (most recent call last): File "tools/train.py", line 172, in main() File "tools/train.py", line 168, in main run(FLAGS, cfg) File "tools/train.py", line 132, in run trainer.train(FLAGS.eval) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/engine/trainer.py", line 485, in train outputs = model(data) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, kwargs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, *kwargs) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/architectures/meta_arch.py", line 59, in forward out = self.get_loss() File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/architectures/yolo.py", line 124, in get_loss return self._forward() File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/architectures/yolo.py", line 88, in _forward yolo_losses = self.yolo_head(neck_feats, self.inputs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(inputs, kwargs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, kwargs) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/heads/ppyoloe_head.py", line 219, in forward return self.forward_train(feats, targets) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/heads/ppyoloe_head.py", line 164, in forward_train ], targets) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/heads/ppyoloe_head.py", line 333, in get_loss bg_index=self.num_classes) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, *kwargs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(inputs, kwargs) File "", line 2, in forward File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/base.py", line 354, in _decorate_function return func(*args, *kwargs) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/assigners/task_aligned_assigner.py", line 115, in forward alignment_metrics is_in_gts, self.topk, topk_mask=pad_gt_mask) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/assigners/utils.py", line 105, in gather_topk_anchors return is_in_topk * topk_mask File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/math_op_patch.py", line 299, in impl return math_op(self, other_var, 'axis', axis) OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:258) [operator < elementwise_mul > error]

upupbo commented 1 year ago

我尝试过对已有的权重进行resume,但是还是报错 [03/19 22:31:04] ppdet.engine INFO: Epoch: [33] [410/694] learning_rate: 0.003717 loss: 1.403906 loss_cls: 0.710496 loss_iou: 0.138635 loss_dfl: 0.691216 loss_l1: 0.242012 eta: 54 days, 23:46:19 batch_cost: 1.2843 data_cost: 0.7744 ips: 56.0627 images/s Traceback (most recent call last): File "tools/train.py", line 172, in main() File "tools/train.py", line 168, in main run(FLAGS, cfg) File "tools/train.py", line 132, in run trainer.train(FLAGS.eval) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/engine/trainer.py", line 485, in train outputs = model(data) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, kwargs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, *kwargs) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/architectures/meta_arch.py", line 59, in forward out = self.get_loss() File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/architectures/yolo.py", line 124, in get_loss return self._forward() File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/architectures/yolo.py", line 88, in _forward yolo_losses = self.yolo_head(neck_feats, self.inputs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(inputs, kwargs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, kwargs) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/heads/ppyoloe_head.py", line 219, in forward return self.forward_train(feats, targets) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/heads/ppyoloe_head.py", line 164, in forward_train ], targets) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/heads/ppyoloe_head.py", line 333, in get_loss bg_index=self.num_classes) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, *kwargs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(inputs, kwargs) File "", line 2, in forward File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/base.py", line 354, in _decorate_function return func(*args, *kwargs) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/assigners/task_aligned_assigner.py", line 115, in forward alignment_metrics is_in_gts, self.topk, topk_mask=pad_gt_mask) File "/root/paddlejob/workspace/env_run/PaddleDetection_cloud/ppdet/modeling/assigners/utils.py", line 105, in gather_topk_anchors return is_in_topk * topk_mask File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/math_op_patch.py", line 299, in impl return math_op(self, other_var, 'axis', axis) OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:258) [operator < elementwise_mul > error]