PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.62k stars 2.87k forks source link

无标签训练loss变nan、报错[BUG] #4272

Open Gaojun211 opened 2 years ago

Gaojun211 commented 2 years ago

PaddleDetection team appreciate any suggestion or problem you delivered~

描述问题

bs=2时,faster_rcnn_r50_1x_coco.yml和 faster_rcnn_r50_fpn_1x_coco.yml单卡训练loss会变nan。 bs=4,双卡训练faster_rcnn_r50_1x_coco.yml loss仍会变nan,调小lr无用,加上fpn后报错。

复现/Reproduction

  1. 您使用的命令是?/What command or script did you run?
python  tools/train.py -c configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml
python -m paddle.distributed.launch  --gpus 1,2  tools/train.py -c configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml
  1. 您是否更改过代码或配置文件?您是否理解您所更改的内容?还请您提供所更改的部分代码。
  2. 您使用的数据集是?/What dataset did you use? 39506正常标注数据+35406无标签数据,设置empty_ratio为 0.1。
  3. 请提供您出现的报错信息及相关log。 bs=2,4卡训练faster_rcnn_r50_1x_coco.yml 已经调小lr,LearningRate: base_lr: 0.0001。
    [10/08 00:43:11] ppdet.engine INFO: Epoch: [0] [  560/21947] learning_rate: 0.000060 loss_rpn_cls: 0.677536 loss_rpn_reg: 0.017564 loss_bbox_cls: 0.030547 loss_bbox_reg: 0.006676 loss: 0.740015 eta: 1 day, 10:42:49 batch_cost: 0.4942 data_cost: 0.0003 ips: 4.0471 images/s
    [10/08 00:43:21] ppdet.engine INFO: Epoch: [0] [  580/21947] learning_rate: 0.000062 loss_rpn_cls: 0.675153 loss_rpn_reg: 0.007527 loss_bbox_cls: 0.026712 loss_bbox_reg: 0.006683 loss: 0.716233 eta: 1 day, 10:44:35 batch_cost: 0.4883 data_cost: 0.0002 ips: 4.0960 images/s
    [10/08 00:43:30] ppdet.engine INFO: Epoch: [0] [  600/21947] learning_rate: 0.000064 loss_rpn_cls: 0.673870 loss_rpn_reg: 0.013568 loss_bbox_cls: 0.029132 loss_bbox_reg: 0.004522 loss: 0.727936 eta: 1 day, 10:44:04 batch_cost: 0.4734 data_cost: 0.0002 ips: 4.2245 images/s
    [10/08 00:43:40] ppdet.engine INFO: Epoch: [0] [  620/21947] learning_rate: 0.000066 loss_rpn_cls: 0.673325 loss_rpn_reg: 0.010504 loss_bbox_cls: 0.028857 loss_bbox_reg: 0.003622 loss: 0.716996 eta: 1 day, 10:44:18 batch_cost: 0.4787 data_cost: 0.0002 ips: 4.1784 images/s
    [10/08 00:43:49] ppdet.engine INFO: Epoch: [0] [  640/21947] learning_rate: 0.000068 loss_rpn_cls: 0.672138 loss_rpn_reg: 0.014652 loss_bbox_cls: 0.027188 loss_bbox_reg: 0.005363 loss: 0.743213 eta: 1 day, 10:44:30 batch_cost: 0.4787 data_cost: 0.0002 ips: 4.1783 images/s
    [10/08 00:43:59] ppdet.engine INFO: Epoch: [0] [  660/21947] learning_rate: 0.000069 loss_rpn_cls: 0.670443 loss_rpn_reg: 0.013540 loss_bbox_cls: 0.027636 loss_bbox_reg: 0.005605 loss: 0.720972 eta: 1 day, 10:44:59 batch_cost: 0.4809 data_cost: 0.0002 ips: 4.1593 images/s
    [10/08 00:44:09] ppdet.engine INFO: Epoch: [0] [  680/21947] learning_rate: 0.000071 loss_rpn_cls: 0.668971 loss_rpn_reg: 0.025812 loss_bbox_cls: 0.034054 loss_bbox_reg: 0.008743 loss: 0.744311 eta: 1 day, 10:44:34 batch_cost: 0.4742 data_cost: 0.0002 ips: 4.2173 images/s
    [10/08 00:44:18] ppdet.engine INFO: Epoch: [0] [  700/21947] learning_rate: 0.000073 loss_rpn_cls: 0.666791 loss_rpn_reg: 0.015758 loss_bbox_cls: 0.024558 loss_bbox_reg: 0.002665 loss: 0.716927 eta: 1 day, 10:44:53 batch_cost: 0.4800 data_cost: 0.0002 ips: 4.1668 images/s
    [10/08 00:44:22] ppdet.engine INFO: Epoch: [0] [  720/21947] learning_rate: 0.000075 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 10:10:39 batch_cost: 0.1956 data_cost: 0.0002 ips: 10.2237 images/s
    [10/08 00:44:25] ppdet.engine INFO: Epoch: [0] [  740/21947] learning_rate: 0.000077 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 9:33:45 batch_cost: 0.1574 data_cost: 0.0002 ips: 12.7053 images/s
    [10/08 00:44:28] ppdet.engine INFO: Epoch: [0] [  760/21947] learning_rate: 0.000078 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 8:58:32 batch_cost: 0.1552 data_cost: 0.0002 ips: 12.8863 images/s
    [10/08 00:44:31] ppdet.engine INFO: Epoch: [0] [  780/21947] learning_rate: 0.000080 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 8:24:24 batch_cost: 0.1489 data_cost: 0.0002 ips: 13.4343 images/s
    [10/08 00:44:35] ppdet.engine INFO: Epoch: [0] [  800/21947] learning_rate: 0.000082 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 7:53:59 batch_cost: 0.1673 data_cost: 0.0002 ips: 11.9563 images/s
    [10/08 00:44:38] ppdet.engine INFO: Epoch: [0] [  820/21947] learning_rate: 0.000084 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 7:24:07 batch_cost: 0.1585 data_cost: 0.0002 ips: 12.6156 images/s
    [10/08 00:44:41] ppdet.engine INFO: Epoch: [0] [  840/21947] learning_rate: 0.000086 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 6:55:06 batch_cost: 0.1530 data_cost: 0.0002 ips: 13.0689 images/s
    [10/08 00:44:44] ppdet.engine INFO: Epoch: [0] [  860/21947] learning_rate: 0.000087 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 6:27:26 batch_cost: 0.1533 data_cost: 0.0002 ips: 13.0505 images/s
    [10/08 00:44:47] ppdet.engine INFO: Epoch: [0] [  880/21947] learning_rate: 0.000089 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 6:02:04 batch_cost: 0.1637 data_cost: 0.0002 ips: 12.2178 images/s
    [10/08 00:44:51] ppdet.engine INFO: Epoch: [0] [  900/21947] learning_rate: 0.000091 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 5:39:07 batch_cost: 0.1769 data_cost: 0.0002 ips: 11.3047 images/s
    [10/08 00:44:54] ppdet.engine INFO: Epoch: [0] [  920/21947] learning_rate: 0.000093 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 5:14:47 batch_cost: 0.1520 data_cost: 0.0002 ips: 13.1615 images/s
    [10/08 00:44:57] ppdet.engine INFO: Epoch: [0] [  940/21947] learning_rate: 0.000095 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 4:51:33 batch_cost: 0.1527 data_cost: 0.0002 ips: 13.1014 images/s
    [10/08 00:45:01] ppdet.engine INFO: Epoch: [0] [  960/21947] learning_rate: 0.000096 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 4:30:18 batch_cost: 0.1640 data_cost: 0.0002 ips: 12.1948 images/s
    [10/08 00:45:04] ppdet.engine INFO: Epoch: [0] [  980/21947] learning_rate: 0.000098 loss_rpn_cls: nan loss_rpn_reg: nan loss_bbox_cls: nan loss_bbox_reg: nan loss: nan eta: 1 day, 4:08:50 batch_cost: 0.1517 data_cost: 0.0002 ips: 13.1846 images/s

    报错信息

    
    [10/08 01:03:01] ppdet.engine INFO: Epoch: [0] [   0/5487] learning_rate: 0.000000 loss_rpn_cls: 0.690131 loss_rpn_reg: 0.068640 loss_bbox_cls: 1.132736 loss_bbox_reg: 0.000039 loss: 1.891546 eta: 1 day, 17:38:11 batch_cost: 2.2765 data_cost: 0.0006 ips: 1.7571 images/s
    [10/08 01:03:14] ppdet.engine INFO: Epoch: [0] [  20/5487] learning_rate: 0.000000 loss_rpn_cls: 0.688798 loss_rpn_reg: 0.021382 loss_bbox_cls: 1.146138 loss_bbox_reg: 0.000087 loss: 1.857482 eta: 13:00:34 batch_cost: 0.6333 data_cost: 0.0027 ips: 6.3165 images/s
    [10/08 01:03:27] ppdet.engine INFO: Epoch: [0] [  40/5487] learning_rate: 0.000000 loss_rpn_cls: 0.688631 loss_rpn_reg: 0.020095 loss_bbox_cls: 1.139235 loss_bbox_reg: 0.000073 loss: 1.851724 eta: 12:15:30 batch_cost: 0.6277 data_cost: 0.0003 ips: 6.3723 images/s
    [10/08 01:03:39] ppdet.engine INFO: Epoch: [0] [  60/5487] learning_rate: 0.000001 loss_rpn_cls: 0.688777 loss_rpn_reg: 0.014815 loss_bbox_cls: 1.119644 loss_bbox_reg: 0.000068 loss: 1.831939 eta: 11:56:34 batch_cost: 0.6186 data_cost: 0.0003 ips: 6.4661 images/s
    [10/08 01:03:52] ppdet.engine INFO: Epoch: [0] [  80/5487] learning_rate: 0.000001 loss_rpn_cls: 0.688067 loss_rpn_reg: 0.013072 loss_bbox_cls: 1.094978 loss_bbox_reg: 0.000858 loss: 1.803280 eta: 11:51:27 batch_cost: 0.6355 data_cost: 0.0003 ips: 6.2945 images/s
    Traceback (most recent call last):
    File "tools/train.py", line 138, in <module>
    main()
    File "tools/train.py", line 134, in main
    run(FLAGS, cfg)
    File "tools/train.py", line 109, in run
    trainer.train(FLAGS.eval)
    File "/data/gaojun/nolabel/PaddleDetection/ppdet/engine/trainer.py", line 358, in train
    outputs = model(data)
    File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
    outputs = self.forward(*inputs, **kwargs)
    File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/parallel.py", line 578, in forward
    outputs = self._layers(*inputs, **kwargs)
    File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
    outputs = self.forward(*inputs, **kwargs)
    File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 26, in forward
    out = self.get_loss()
    File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/architectures/faster_rcnn.py", line 95, in get_loss
    rpn_loss, bbox_loss = self._forward()
    File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/architectures/faster_rcnn.py", line 78, in _forward
    self.inputs)
    File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
    outputs = self.forward(*inputs, **kwargs)
    File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/heads/bbox_head.py", line 237, in forward
    rois, rois_num, targets = self.bbox_assigner(rois, rois_num, inputs)
    File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/proposal_generator/target_layer.py", line 152, in __call__
    self.cascade_iou[stage])
    File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/proposal_generator/target.py", line 194, in generate_proposal_target
    gt_class = paddle.squeeze(gt_classes[i], axis=-1)
    File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/tensor/manipulation.py", line 613, in squeeze
    return layers.squeeze(x, axis, name)
    File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/layers/nn.py", line 6266, in squeeze
    out, _ = core.ops.squeeze2(input, 'axes', axes)
    OSError: (External)  Cuda error(700), an illegal memory access was encountered.
    [Advise: Please search for the error code(700) on website( https://docs.nvidia.com/cuda/archive/9.0/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038 ) to get Nvidia's official solution about CUDA Error.] (at /paddle/paddle/fluid/platform/gpu_info.cc:394)
    [operator < squeeze2 > error]

C++ Traceback (most recent call last):

0 paddle::memory::allocation::CUDADeviceContextAllocatorPool::~CUDADeviceContextAllocatorPool() 1 std::_Sp_counted_base<(gnu_cxx::_Lock_policy)2>::_M_release() 2 std::_Sp_counted_ptr<paddle::memory::allocation::CUDADeviceContextAllocator*, (gnu_cxx::_Lock_policy)2>::_M_dispose() 3 paddle::platform::build_nvidia_error_msgabi:cxx11 4 paddle::platform::proto::cudaerrorDesc::ByteSizeLong() const 5 paddle::platform::proto::AllMessageDesc::ByteSizeLong() const 6 paddle::platform::proto::MessageDesc::ByteSizeLong() const 7 google::protobuf::internal::WireFormat::ComputeUnknownFieldsSize(google::protobuf::UnknownFieldSet const&) 8 paddle::framework::SignalHandle(char const*, int) 9 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()


Error Message Summary:

FatalError: Segmentation fault is detected by the operating system. [TimeInfo: Aborted at 1633655040 (unix time) try "date -d @1633655040" if you are using GNU date ] [SignalInfo: SIGSEGV (@0x2c0) received by PID 51756 (TID 0x7f82cb388740) from PID 704 ]


## 环境/Environment
1. 请提供您使用的Paddle和PaddleDetection的版本号/Please provide the version of Paddle and PaddleDetection you use:
paddlepaddle-gpu==2.1.3、PaddleDetection release/2.2
2. 如您在使用PaddleDetection的同时还在使用其他产品,如PaddleServing、PaddleInference等,请您提供其版本号
3. 请提供您使用的操作系统信息
 Linux version 4.15.0-144-generic (buildd@lgw01-amd64-031) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) 
4. 请问您使用的Python版本是?/ Please provide the version of Python you used.
python3.6.12
5. 请问您使用的CUDA/cuDNN的版本号是?/ Please provide the version of CUDA/cuDNN you used.
cuda11.2,cudnn8.0
qingqing01 commented 2 years ago
  1. 增加无标签数据时,注意在YML配置中:

https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.2/configs/datasets/coco_detection.yml

设置allow_empty: True , 这个参数默认是False,参考:

https://github.com/PaddlePaddle/PaddleDetection/blob/5b949596ea7603cd79e3fc9067766bbc79a3e93d/ppdet/data/source/coco.py#L51

不知道您这里是否设置?

  1. LR的调整参考 FAQ , https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.2/docs/tutorials/FAQ/FAQ%E7%AC%AC%E9%9B%B6%E6%9C%9F.md#faq%E7%AC%AC%E9%9B%B6%E6%9C%9F
Gaojun211 commented 2 years ago

1、数据的yml中已设置

    allow_empty: True
    empty_ratio: 0.1

2、学习率按照FAQ设置的话0.00125×2卡×bs4=0.01,也参考了 https://github.com/PaddlePaddle/PaddleDetection/issues/3326#issuecomment-856654034 调小后也会变nan,报错。

qingqing01 commented 2 years ago

关于NaN的问题,建议可以加载COCO预训练,默认是ImageNet的预训练;以及调参看下。 关于FPN报错的问题,请问不用无标签数据,empty_ratio是0时还会出错嘛? 以及是否有可复现的数据?

Gaojun211 commented 2 years ago

不用无标签正常训练,使用开源数据集加无标签后测试仍然存在相同的问题:loss变nan和多卡训练报错。 数据集链接:https://aistudio.baidu.com/aistudio/datasetdetail/103628/0

jerrywgz commented 2 years ago

我这边使用develop分支训练faster rcnn在bs=1 和bs=2情况下是能正常收敛的,我设置的empty_ratio也是0.1 可以用develop分支试下,先前修复过一次无标签数据在bs=2时训练的问题https://github.com/PaddlePaddle/PaddleDetection/pull/3890 还有一点需要注意的是,你的数据集是4类,在数据配置文件里需要把num_class改成4

Gaojun211 commented 2 years ago

同事一开始也是正常收敛,再迭代2epoch之后才nan的,定位到https://github.com/PaddlePaddle/PaddleDetection/blob/5b949596ea7603cd79e3fc9067766bbc79a3e93d/ppdet/modeling/proposal_generator/target.py#L130-L131 出错的时候,推出的正样本的框是空值,猜测是一个batch里面都是背景图片,没法去匹配正样本,导致数据访问出错。 我用git checkout develop后仍存在问题: bs=2:

[10/11 05:17:05] ppdet.engine INFO: Epoch: [0] [  0/220] learning_rate: 0.000500 loss_rpn_cls: 0.694932 loss_rpn_reg: 0.027445 loss_bbox_cls: 1.705214 loss_bbox_reg: 0.000087 loss: 2.427677 eta: 4:23:22 batch_cost: 5.9857 data_cost: 0.0006 ips: 0.3341 images/s
[10/11 05:17:20] ppdet.engine INFO: Epoch: [0] [ 20/220] learning_rate: 0.000590 loss_rpn_cls: 0.672942 loss_rpn_reg: 0.009165 loss_bbox_cls: 0.154369 loss_bbox_reg: 0.000958 loss: 0.845150 eta: 0:45:27 batch_cost: 0.7937 data_cost: 0.0002 ips: 2.5200 images/s
[10/11 05:17:38] ppdet.engine INFO: Epoch: [0] [ 40/220] learning_rate: 0.000680 loss_rpn_cls: 0.146619 loss_rpn_reg: 0.009198 loss_bbox_cls: 0.081120 loss_bbox_reg: 0.004745 loss: 0.345999 eta: 0:41:25 batch_cost: 0.8664 data_cost: 0.0002 ips: 2.3084 images/s
[10/11 05:17:54] ppdet.engine INFO: Epoch: [0] [ 60/220] learning_rate: 0.000770 loss_rpn_cls: 0.067352 loss_rpn_reg: 0.008538 loss_bbox_cls: 0.150343 loss_bbox_reg: 0.097611 loss: 0.344274 eta: 0:39:13 batch_cost: 0.8230 data_cost: 0.0002 ips: 2.4302 images/s
[10/11 05:18:11] ppdet.engine INFO: Epoch: [0] [ 80/220] learning_rate: 0.000860 loss_rpn_cls: 0.037713 loss_rpn_reg: 0.005362 loss_bbox_cls: 0.122740 loss_bbox_reg: 0.107580 loss: 0.275064 eta: 0:38:17 batch_cost: 0.8527 data_cost: 0.0002 ips: 2.3455 images/s
INFO 2021-10-11 05:18:33,687 launch_utils.py:327] terminate all the procs
ERROR 2021-10-11 05:18:33,688 launch_utils.py:584] ABORT!!! Out of all 2 trainers, the trainer process with rank=[1] was aborted. Please check its log.
INFO 2021-10-11 05:18:36,691 launch_utils.py:327] terminate all the procs

bs=4情况下报错更快:

[10/11 05:13:47] ppdet.engine INFO: Epoch: [0] [  0/110] learning_rate: 0.000500 loss_rpn_cls: 0.699712 loss_rpn_reg: 0.010741 loss_bbox_cls: 1.648625 loss_bbox_reg: 0.012263 loss: 2.371341 eta: 1:07:41 batch_cost: 3.0769 data_cost: 0.0007 ips: 1.3000 images/s
Traceback (most recent call last):
  File "tools/train.py", line 140, in <module>
    main()
  File "tools/train.py", line 136, in main
    run(FLAGS, cfg)
  File "tools/train.py", line 109, in run
    trainer.train(FLAGS.eval)
  File "/data/gaojun/nolabel/PaddleDetection/ppdet/engine/trainer.py", line 369, in train
    outputs = model(data)
  File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/parallel.py", line 578, in forward
    outputs = self._layers(*inputs, **kwargs)
  File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 54, in forward
    out = self.get_loss()
  File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/architectures/faster_rcnn.py", line 95, in get_loss
    rpn_loss, bbox_loss = self._forward()
  File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/architectures/faster_rcnn.py", line 78, in _forward
    self.inputs)
  File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/heads/bbox_head.py", line 237, in forward
    rois, rois_num, targets = self.bbox_assigner(rois, rois_num, inputs)
  File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/proposal_generator/target_layer.py", line 169, in __call__
    self.cascade_iou[stage], self.assign_on_cpu)
  File "/data/gaojun/nolabel/PaddleDetection/ppdet/modeling/proposal_generator/target.py", line 202, in generate_proposal_target
    gt_class = paddle.squeeze(gt_classes[i], axis=-1)
  File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/tensor/manipulation.py", line 613, in squeeze
    return layers.squeeze(x, axis, name)
  File "/data/gaojun/.miniconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/layers/nn.py", line 6266, in squeeze
    out, _ = core.ops.squeeze2(input, 'axes', axes)
OSError: (External)  Cuda error(700), an illegal memory access was encountered.
  [Advise: Please search for the error code(700) on website( https://docs.nvidia.com/cuda/archive/9.0/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038 ) to get Nvidia's official solution about CUDA Error.] (at /paddle/paddle/fluid/platform/gpu_info.cc:394)
  [operator < squeeze2 > error]

您那边再试一下更大的bs,看看会不会报错

jerrywgz commented 2 years ago

我这边复现了nan的问题,使用bs=1是可以的

Gaojun211 commented 2 years ago

是的,之前未修复的时候bs=1也是可以跑的。 https://github.com/PaddlePaddle/PaddleDetection/issues/3790#issuecomment-887224793

Gaojun211 commented 2 years ago

@jerrywgz 不知您那边问题解决没有?