CUDA error when evaluating RT-DETR trained on custom data

ichbing1 commented 1 year ago

问题确认 Search before asking

[X] 我已经搜索过问题，但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

While training RT-DETR on my custom dataset in COCO format, I'm getting the following errors after the first training epoch when it tries to evaluate the model.

I searched for similar questions and found some answers that it's because of indices exceeding the number of categories, but I double-checked and my dataset has no such problem. I also specified the new number of classes in the data configuration file configs/datasets/my_dataset.yml.

Just to make sure, I also tested training with validation data (which caused errors during evaluation), and confirmed that training proceeds without errors. Likewise, I tested evaluating with training data, and got the same errors. Also, I get the same error when running inference.

Am I missing something here? Any suggestions or comments are appreciated. Thanks!

I'm using PaddleDetection docker paddlecloud/paddledetection:2.4-gpu-cuda11.2-cudnn8-e9a542 with paddlepaddle-gpu==2.4.2.post112 installed. (2.4.1 required for RT-DETR) Same issue observed on Tesla V100 & RTX 4080. The model is rtdetr_r50vd.yml

[07/01 10:30:36] ppdet.utils.checkpoint INFO: Save checkpoint: output
loading annotations into memory...
Done (t=0.40s)
creating index...
index created!
[07/01 10:30:38] ppdet.data.source.coco INFO: Load [3814 samples valid, 0 samples invalid] in file dataset/my_dataset/annotations/val.json.
loading annotations into memory...
Done (t=0.74s)
creating index...
index created!
Error: /paddle/paddle/phi/kernels/funcs/gather.cu.h:68 Assertion `index_value >= 0 && index_value < input_dims[j]` failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [300] and greater than or equal to 0, but received [0]
Error: /paddle/paddle/phi/kernels/funcs/gather.cu.h:68 Assertion `index_value >= 0 && index_value < input_dims[j]` failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [300] and greater than or equal to 0, but received [0]
Error: /paddle/paddle/phi/kernels/funcs/gather.cu.h:68 Assertion `index_value >= 0 && index_value < input_dims[j]` failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [300] and greater than or equal to 0, but received [0]
[....]
Error: /paddle/paddle/phi/kernels/funcs/gather.cu.h:68 Assertion `index_value >= 0 && index_value < input_dims[j]` failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [300] and greater than or equal to 0, but received [0]
Traceback (most recent call last):
  File "tools/train.py", line 209, in <module>
    main()
  File "tools/train.py", line 205, in main
    run(FLAGS, cfg)
  File "tools/train.py", line 158, in run
    trainer.train(FLAGS.eval)
  File "/home/PaddleDetection/ppdet/engine/trainer.py", line 639, in train
    self._eval_with_loader(self._eval_loader)
  File "/home/PaddleDetection/ppdet/engine/trainer.py", line 672, in _eval_with_loader
    outs = self.model(data)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 76, in forward
    outs.append(self.get_pred())
  File "/home/PaddleDetection/ppdet/modeling/architectures/detr.py", line 118, in get_pred
    return self._forward()
  File "/home/PaddleDetection/ppdet/modeling/architectures/detr.py", line 107, in _forward
    paddle.shape(self.inputs['image'])[2:])
  File "/home/PaddleDetection/ppdet/modeling/post_process.py", line 571, in __call__
    self.num_top_queries, dtype='int32').tile([bbox_pred.shape[0]])
  File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 546, in to_tensor
    return _to_tensor_non_static(data, dtype, place, stop_gradient)
  File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 411, in _to_tensor_non_static
    stop_gradient=stop_gradient,
OSError: (External) CUDA error(719), unspecified launch failure.
  [Hint: Please search for the error code(719) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:259)

ichbing1 commented 1 year ago

After some investigation, I found that the trained model is producing NaN as output, which results in invalid index values that in turn lead to the above CUDA error.

However, the training looked perfectly normal. There was no Inf or NaN issue during the training, and the loss was decreasing gradually.

I don't see how a normally trained model can output NaN during inference.

Edit: The error does not occur when I train a smaller model, rtdetr_r34vd.

HansenLYX0708 commented 1 year ago

I also encounter same issue, I found it seems due to paddle.gather_nd have some bug, error line should be line: 542 in post_process.py, the "index" have some error.

bbox_pred = paddle.gather_nd(bbox_pred, index)

HansenLYX0708 commented 1 year ago

I try set snapshot_epoch to 20, rtdetr_r50vd is normal now

nijkah commented 1 year ago

I try set snapshot_epoch to 20, rtdetr_r50vd is normal now

Hi @HansenLYX0708 , I also encountered the same problem when training rtdetr_r50vd from COCO. Did you mean after training some epochs, rtdetr_r50vd produces normal performance?

ichbing1 commented 1 year ago

I try set snapshot_epoch to 20, rtdetr_r50vd is normal now

I had tried the same strategy (training without evaluation for longer epochs), but that didn't solve the problem for me. Had stopped training just before 20 epochs though.

HansenLYX0708 commented 1 year ago

@ichbing1 I found this error is due to top-k score is NaN when call DETRPostProcess class, but in my debugging, it's called only when Evaluation, so is same error when you training?

HansenLYX0708 commented 1 year ago

@nijkah I only have one GPU, but the README document suggest use 4 GPUs, so I haven't got the exact performance yet, I can only can said that set a larger evaluate epoch can avoid encounter NaN when evaluation, and here is my initial result on custom data, its seems a little hard training with one GPU, after 72 epoch, the loss is 20.8

[07/05 16:26:49] ppdet.engine INFO: Epoch: [70] [ 0/159] learning_rate: 0.000100 loss_class: 0.362942 loss_bbox: 0.154411 loss_giou: 0.858718 loss_class_aux: 4.252550 loss_bbox_aux: 1.008297 loss_giou_aux: 5.454723 loss_class_dn: 0.057903 loss_bbox_dn: 0.154584 loss_giou_dn: 0.852306 loss_class_aux_dn: 2.114557 loss_bbox_aux_dn: 0.806046 loss_giou_aux_dn: 4.419468 loss: 20.900740 eta: 0:05:17 batch_cost: 0.9740 data_cost: 0.4453 ips: 4.1069 images/s [07/05 16:28:40] ppdet.engine INFO: Epoch: [70] [100/159] learning_rate: 0.000100 loss_class: 0.368415 loss_bbox: 0.140788 loss_giou: 0.888129 loss_class_aux: 4.165645 loss_bbox_aux: 0.894875 loss_giou_aux: 5.477043 loss_class_dn: 0.050430 loss_bbox_dn: 0.143358 loss_giou_dn: 0.856257 loss_class_aux_dn: 2.078147 loss_bbox_aux_dn: 0.738201 loss_giou_aux_dn: 4.405769 loss: 20.772293 eta: 0:03:37 batch_cost: 0.9739 data_cost: 0.4412 ips: 4.1073 images/s [07/05 16:29:46] ppdet.engine INFO: Epoch: [71] [ 0/159] learning_rate: 0.000100 loss_class: 0.378894 loss_bbox: 0.139767 loss_giou: 0.860358 loss_class_aux: 4.339354 loss_bbox_aux: 0.895106 loss_giou_aux: 5.409535 loss_class_dn: 0.051929 loss_bbox_dn: 0.157092 loss_giou_dn: 0.853588 loss_class_aux_dn: 2.108598 loss_bbox_aux_dn: 0.801123 loss_giou_aux_dn: 4.420537 loss: 20.815054 eta: 0:02:38 batch_cost: 1.0244 data_cost: 0.4896 ips: 3.9046 images/s [07/05 16:31:45] ppdet.engine INFO: Epoch: [71] [100/159] learning_rate: 0.000100 loss_class: 0.365334 loss_bbox: 0.150335 loss_giou: 0.881706 loss_class_aux: 4.108303 loss_bbox_aux: 0.951658 loss_giou_aux: 5.464084 loss_class_dn: 0.060056 loss_bbox_dn: 0.140181 loss_giou_dn: 0.848054 loss_class_aux_dn: 2.094025 loss_bbox_aux_dn: 0.729748 loss_giou_aux_dn: 4.378035 loss: 20.830103 eta: 0:00:58 batch_cost: 1.0608 data_cost: 0.5213 ips: 3.7707 images/s

lyuwenyu commented 1 year ago

I think you should adapt lr according to total batch size, eg. lr * 0.1

https://github.com/lyuwenyu/RT-DETR/issues

nijkah commented 1 year ago

I think you should adapt lr according to total batch size, eg. lr * 0.1

https://github.com/lyuwenyu/RT-DETR/issues

@lyuwenyu In my case, I train RT-DETR r50vd on COCO with 2 GPU (8 batch size each) in PPDet. So I didnt' change the LR. However, it throws the error even evaluation is done with the 6 epoch trained model. :( R18vd version shows normal convergence and performance. Can I ask you to check it?

ichbing1 commented 1 year ago

@ichbing1 I found this error is due to top-k score is NaN when call DETRPostProcess class, but in my debugging, it's called only when Evaluation, so is same error when you training?

Yes, I'm also seeing the error only during evaluation. And I think I found a solution.

I backtraced the NaN output of the network to see where the values start to diverge. It seems that conv1 in ResNet50 backbone (https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/resnet.py#L580) is producing larger values in evaluation mode (in the range of 0-10) than in training mode (in the range of 0-2; training is resumed from the same checkpoint), which get amplified through the subsequent res_layers into very large numbers, then finally become NaN.

Then I noticed that freeze_norm option is turned on only for ResNet50 (it's turned off for ResNet34 & ResNet18), so I tried turning it off by adding freeze_norm: false in configs/rtdetr/_base_/rtdetr_r50vd.yaml, thinking that it might help prevent the values from getting extremely large. I also turned off parameter freeze option for conv1 by modifying the line freeze_at: 0 to freeze_at: -1.

Now when I train rtdetr_r50vd with the modified config, I don't see the CUDA error during evaluation any more.

Plusmile commented 1 year ago

ichbing1的回答完美解决了我的问题

PaddlePaddle / PaddleDetection

CUDA error when evaluating RT-DETR trained on custom data #8402

问题确认 Search before asking

请提出你的问题 Please ask your question