Closed ichbing1 closed 6 months ago
After some investigation, I found that the trained model is producing NaN as output, which results in invalid index values that in turn lead to the above CUDA error.
However, the training looked perfectly normal. There was no Inf or NaN issue during the training, and the loss was decreasing gradually.
I don't see how a normally trained model can output NaN during inference.
Edit: The error does not occur when I train a smaller model, rtdetr_r34vd
.
I also encounter same issue, I found it seems due to paddle.gather_nd have some bug, error line should be line: 542 in post_process.py, the "index" have some error.
bbox_pred = paddle.gather_nd(bbox_pred, index)
I try set snapshot_epoch to 20, rtdetr_r50vd is normal now
I try set snapshot_epoch to 20, rtdetr_r50vd is normal now
Hi @HansenLYX0708 , I also encountered the same problem when training rtdetr_r50vd from COCO. Did you mean after training some epochs, rtdetr_r50vd produces normal performance?
I try set snapshot_epoch to 20, rtdetr_r50vd is normal now
I had tried the same strategy (training without evaluation for longer epochs), but that didn't solve the problem for me. Had stopped training just before 20 epochs though.
@ichbing1 I found this error is due to top-k score is NaN when call DETRPostProcess class, but in my debugging, it's called only when Evaluation, so is same error when you training?
@nijkah I only have one GPU, but the README document suggest use 4 GPUs, so I haven't got the exact performance yet, I can only can said that set a larger evaluate epoch can avoid encounter NaN when evaluation, and here is my initial result on custom data, its seems a little hard training with one GPU, after 72 epoch, the loss is 20.8
[07/05 16:26:49] ppdet.engine INFO: Epoch: [70] [ 0/159] learning_rate: 0.000100 loss_class: 0.362942 loss_bbox: 0.154411 loss_giou: 0.858718 loss_class_aux: 4.252550 loss_bbox_aux: 1.008297 loss_giou_aux: 5.454723 loss_class_dn: 0.057903 loss_bbox_dn: 0.154584 loss_giou_dn: 0.852306 loss_class_aux_dn: 2.114557 loss_bbox_aux_dn: 0.806046 loss_giou_aux_dn: 4.419468 loss: 20.900740 eta: 0:05:17 batch_cost: 0.9740 data_cost: 0.4453 ips: 4.1069 images/s [07/05 16:28:40] ppdet.engine INFO: Epoch: [70] [100/159] learning_rate: 0.000100 loss_class: 0.368415 loss_bbox: 0.140788 loss_giou: 0.888129 loss_class_aux: 4.165645 loss_bbox_aux: 0.894875 loss_giou_aux: 5.477043 loss_class_dn: 0.050430 loss_bbox_dn: 0.143358 loss_giou_dn: 0.856257 loss_class_aux_dn: 2.078147 loss_bbox_aux_dn: 0.738201 loss_giou_aux_dn: 4.405769 loss: 20.772293 eta: 0:03:37 batch_cost: 0.9739 data_cost: 0.4412 ips: 4.1073 images/s [07/05 16:29:46] ppdet.engine INFO: Epoch: [71] [ 0/159] learning_rate: 0.000100 loss_class: 0.378894 loss_bbox: 0.139767 loss_giou: 0.860358 loss_class_aux: 4.339354 loss_bbox_aux: 0.895106 loss_giou_aux: 5.409535 loss_class_dn: 0.051929 loss_bbox_dn: 0.157092 loss_giou_dn: 0.853588 loss_class_aux_dn: 2.108598 loss_bbox_aux_dn: 0.801123 loss_giou_aux_dn: 4.420537 loss: 20.815054 eta: 0:02:38 batch_cost: 1.0244 data_cost: 0.4896 ips: 3.9046 images/s [07/05 16:31:45] ppdet.engine INFO: Epoch: [71] [100/159] learning_rate: 0.000100 loss_class: 0.365334 loss_bbox: 0.150335 loss_giou: 0.881706 loss_class_aux: 4.108303 loss_bbox_aux: 0.951658 loss_giou_aux: 5.464084 loss_class_dn: 0.060056 loss_bbox_dn: 0.140181 loss_giou_dn: 0.848054 loss_class_aux_dn: 2.094025 loss_bbox_aux_dn: 0.729748 loss_giou_aux_dn: 4.378035 loss: 20.830103 eta: 0:00:58 batch_cost: 1.0608 data_cost: 0.5213 ips: 3.7707 images/s
I think you should adapt lr according to total batch size, eg. lr * 0.1
I think you should adapt lr according to total batch size, eg.
lr * 0.1
@lyuwenyu In my case, I train RT-DETR r50vd on COCO with 2 GPU (8 batch size each) in PPDet. So I didnt' change the LR. However, it throws the error even evaluation is done with the 6 epoch trained model. :( R18vd version shows normal convergence and performance. Can I ask you to check it?
@ichbing1 I found this error is due to top-k score is NaN when call DETRPostProcess class, but in my debugging, it's called only when Evaluation, so is same error when you training?
Yes, I'm also seeing the error only during evaluation. And I think I found a solution.
I backtraced the NaN output of the network to see where the values start to diverge. It seems that conv1
in ResNet50 backbone (https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/resnet.py#L580) is producing larger values in evaluation mode (in the range of 0-10) than in training mode (in the range of 0-2; training is resumed from the same checkpoint), which get amplified through the subsequent res_layers
into very large numbers, then finally become NaN.
Then I noticed that freeze_norm
option is turned on only for ResNet50 (it's turned off for ResNet34 & ResNet18), so I tried turning it off by adding freeze_norm: false
in configs/rtdetr/_base_/rtdetr_r50vd.yaml
, thinking that it might help prevent the values from getting extremely large. I also turned off parameter freeze option for conv1
by modifying the line freeze_at: 0
to freeze_at: -1
.
Now when I train rtdetr_r50vd
with the modified config, I don't see the CUDA error during evaluation any more.
ichbing1的回答完美解决了我的问题
问题确认 Search before asking
请提出你的问题 Please ask your question
While training RT-DETR on my custom dataset in COCO format, I'm getting the following errors after the first training epoch when it tries to evaluate the model.
I searched for similar questions and found some answers that it's because of indices exceeding the number of categories, but I double-checked and my dataset has no such problem. I also specified the new number of classes in the data configuration file
configs/datasets/my_dataset.yml
.Just to make sure, I also tested training with validation data (which caused errors during evaluation), and confirmed that training proceeds without errors. Likewise, I tested evaluating with training data, and got the same errors. Also, I get the same error when running inference.
Am I missing something here? Any suggestions or comments are appreciated. Thanks!
I'm using PaddleDetection docker
paddlecloud/paddledetection:2.4-gpu-cuda11.2-cudnn8-e9a542
withpaddlepaddle-gpu==2.4.2.post112
installed. (2.4.1 required for RT-DETR) Same issue observed on Tesla V100 & RTX 4080. The model isrtdetr_r50vd.yml