Open Gaojun211 opened 2 years ago
https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.2/configs/datasets/coco_detection.yml
设置allow_empty: True , 这个参数默认是False,参考:
不知道您这里是否设置?
1、数据的yml中已设置
allow_empty: True
empty_ratio: 0.1
2、学习率按照FAQ设置的话0.00125×2卡×bs4=0.01,也参考了 https://github.com/PaddlePaddle/PaddleDetection/issues/3326#issuecomment-856654034 调小后也会变nan,报错。
关于NaN的问题,建议可以加载COCO预训练,默认是ImageNet的预训练;以及调参看下。 关于FPN报错的问题,请问不用无标签数据,empty_ratio是0时还会出错嘛? 以及是否有可复现的数据?
不用无标签正常训练,使用开源数据集加无标签后测试仍然存在相同的问题:loss变nan和多卡训练报错。 数据集链接:https://aistudio.baidu.com/aistudio/datasetdetail/103628/0
我这边使用develop分支训练faster rcnn在bs=1 和bs=2情况下是能正常收敛的,我设置的empty_ratio也是0.1 可以用develop分支试下,先前修复过一次无标签数据在bs=2时训练的问题https://github.com/PaddlePaddle/PaddleDetection/pull/3890 还有一点需要注意的是,你的数据集是4类,在数据配置文件里需要把num_class改成4
同事一开始也是正常收敛,再迭代2epoch之后才nan的,定位到https://github.com/PaddlePaddle/PaddleDetection/blob/5b949596ea7603cd79e3fc9067766bbc79a3e93d/ppdet/modeling/proposal_generator/target.py#L130-L131 出错的时候,推出的正样本的框是空值,猜测是一个batch里面都是背景图片,没法去匹配正样本,导致数据访问出错。 我用git checkout develop后仍存在问题: bs=2:
[10/11 05:17:05] ppdet.engine INFO: Epoch: [0] [ 0/220] learning_rate: 0.000500 loss_rpn_cls: 0.694932 loss_rpn_reg: 0.027445 loss_bbox_cls: 1.705214 loss_bbox_reg: 0.000087 loss: 2.427677 eta: 4:23:22 batch_cost: 5.9857 data_cost: 0.0006 ips: 0.3341 images/s
[10/11 05:17:20] ppdet.engine INFO: Epoch: [0] [ 20/220] learning_rate: 0.000590 loss_rpn_cls: 0.672942 loss_rpn_reg: 0.009165 loss_bbox_cls: 0.154369 loss_bbox_reg: 0.000958 loss: 0.845150 eta: 0:45:27 batch_cost: 0.7937 data_cost: 0.0002 ips: 2.5200 images/s
[10/11 05:17:38] ppdet.engine INFO: Epoch: [0] [ 40/220] learning_rate: 0.000680 loss_rpn_cls: 0.146619 loss_rpn_reg: 0.009198 loss_bbox_cls: 0.081120 loss_bbox_reg: 0.004745 loss: 0.345999 eta: 0:41:25 batch_cost: 0.8664 data_cost: 0.0002 ips: 2.3084 images/s
[10/11 05:17:54] ppdet.engine INFO: Epoch: [0] [ 60/220] learning_rate: 0.000770 loss_rpn_cls: 0.067352 loss_rpn_reg: 0.008538 loss_bbox_cls: 0.150343 loss_bbox_reg: 0.097611 loss: 0.344274 eta: 0:39:13 batch_cost: 0.8230 data_cost: 0.0002 ips: 2.4302 images/s
[10/11 05:18:11] ppdet.engine INFO: Epoch: [0] [ 80/220] learning_rate: 0.000860 loss_rpn_cls: 0.037713 loss_rpn_reg: 0.005362 loss_bbox_cls: 0.122740 loss_bbox_reg: 0.107580 loss: 0.275064 eta: 0:38:17 batch_cost: 0.8527 data_cost: 0.0002 ips: 2.3455 images/s
INFO 2021-10-11 05:18:33,687 launch_utils.py:327] terminate all the procs
ERROR 2021-10-11 05:18:33,688 launch_utils.py:584] ABORT!!! Out of all 2 trainers, the trainer process with rank=[1] was aborted. Please check its log.
INFO 2021-10-11 05:18:36,691 launch_utils.py:327] terminate all the procs
bs=4情况下报错更快:
[10/11 05:13:47] ppdet.engine INFO: Epoch: [0] [ 0/110] learning_rate: 0.000500 loss_rpn_cls: 0.699712 loss_rpn_reg: 0.010741 loss_bbox_cls: 1.648625 loss_bbox_reg: 0.012263 loss: 2.371341 eta: 1:07:41 batch_cost: 3.0769 data_cost: 0.0007 ips: 1.3000 images/s
Traceback (most recent call last):
File "tools/train.py", line 140, in <module>
main()
File "tools/train.py", line 136, in main
run(FLAGS, cfg)
File "tools/train.py", line 109, in run
trainer.train(FLAGS.eval)
File "/data/gaojun/nolabel/PaddleDetection/ppdet/engine/trainer.py", line 369, in train
outputs = model(data)
File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
outputs = self.forward(*inputs, **kwargs)
File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/parallel.py", line 578, in forward
outputs = self._layers(*inputs, **kwargs)
File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
outputs = self.forward(*inputs, **kwargs)
File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 54, in forward
out = self.get_loss()
File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/architectures/faster_rcnn.py", line 95, in get_loss
rpn_loss, bbox_loss = self._forward()
File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/architectures/faster_rcnn.py", line 78, in _forward
self.inputs)
File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
outputs = self.forward(*inputs, **kwargs)
File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/heads/bbox_head.py", line 237, in forward
rois, rois_num, targets = self.bbox_assigner(rois, rois_num, inputs)
File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/proposal_generator/target_layer.py", line 169, in __call__
self.cascade_iou[stage], self.assign_on_cpu)
File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/proposal_generator/target.py", line 202, in generate_proposal_target
gt_class = paddle.squeeze(gt_classes[i], axis=-1)
File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/tensor/manipulation.py", line 613, in squeeze
return layers.squeeze(x, axis, name)
File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/layers/nn.py", line 6266, in squeeze
out, _ = core.ops.squeeze2(input, 'axes', axes)
OSError: (External) Cuda error(700), an illegal memory access was encountered.
[Advise: Please search for the error code(700) on website( https://docs.nvidia.com/cuda/archive/9.0/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038 ) to get Nvidia's official solution about CUDA Error.] (at /paddle/paddle/fluid/platform/gpu_info.cc:394)
[operator < squeeze2 > error]
您那边再试一下更大的bs,看看会不会报错
我这边复现了nan的问题,使用bs=1是可以的
是的,之前未修复的时候bs=1也是可以跑的。 https://github.com/PaddlePaddle/PaddleDetection/issues/3790#issuecomment-887224793
@jerrywgz 不知您那边问题解决没有?
PaddleDetection team appreciate any suggestion or problem you delivered~
描述问题
bs=2时,faster_rcnn_r50_1x_coco.yml和 faster_rcnn_r50_fpn_1x_coco.yml单卡训练loss会变nan。 bs=4,双卡训练faster_rcnn_r50_1x_coco.yml loss仍会变nan,调小lr无用,加上fpn后报错。
复现/Reproduction
报错信息
C++ Traceback (most recent call last):
0 paddle::memory::allocation::CUDADeviceContextAllocatorPool::~CUDADeviceContextAllocatorPool() 1 std::_Sp_counted_base<(gnu_cxx::_Lock_policy)2>::_M_release() 2 std::_Sp_counted_ptr<paddle::memory::allocation::CUDADeviceContextAllocator*, (gnu_cxx::_Lock_policy)2>::_M_dispose() 3 paddle::platform::build_nvidia_error_msgabi:cxx11 4 paddle::platform::proto::cudaerrorDesc::ByteSizeLong() const 5 paddle::platform::proto::AllMessageDesc::ByteSizeLong() const 6 paddle::platform::proto::MessageDesc::ByteSizeLong() const 7 google::protobuf::internal::WireFormat::ComputeUnknownFieldsSize(google::protobuf::UnknownFieldSet const&) 8 paddle::framework::SignalHandle(char const*, int) 9 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()
Error Message Summary:
FatalError:
Segmentation fault
is detected by the operating system. [TimeInfo: Aborted at 1633655040 (unix time) try "date -d @1633655040" if you are using GNU date ] [SignalInfo: SIGSEGV (@0x2c0) received by PID 51756 (TID 0x7f82cb388740) from PID 704 ]