confused about the format of your labeled datasets CEPDOF

cs-heibao commented 4 years ago

the gt bounding-box format cx, cy, w, h angle, all of the five values are the rotated images corresponding parameters? and the angle is definition is the follow:

duanzhiihao commented 4 years ago

I can't fully understand your question, but I guess you want to ask the definition of the angle in the ground truth bboxes. The angle is defined as the degrees that the bounding box is rotated clockwise. For example, in the following figure, the angle should be around 60 or -120 (degrees). angle=60 and angle=120 are equivalent.

cs-heibao commented 4 years ago

@duanzhiihao hi, I mean the angle is which side(bw or bh) between x-axis or y-axis? and cx, cy w, h belongs to rotated object or original object?

duanzhiihao commented 4 years ago

To be explicit, a rotated bounding box is described by cx, cy, w, h, angle. h is defined as the longer side of the bounding box; in other words, h is always greater or equal to w.

angle is between h and the y-axis (clockwise). Equivalently, it is also between w and the x-axis (clockwise).

cx, cy, w, h belongs to the rotated object.

cs-heibao commented 4 years ago

@duanzhiihao thanks, I got it, and I also visualize the labeled images. Another questions hope you give some instructions, I download the CEPDOF dataset, pretrained model and tried training using the script train.py, but the loss seems unusual, should I change some parameters?

Total time: 1:30:47.420144, iter: 0:00:10.916674, epoch: 2:48:08.225082
[Iteration 497] [learning rate 9.96e-05] [Total loss 86.83] [img size 512]
16_total1e-16objects_loss_xy: 0.0, loss_wh: 0.0, loss_angle: 0.0, conf: 0.12520018219947815
32_totalTrueobjects_loss_xy: 12.013134002685547, loss_wh: 0.28178492188453674, loss_angle: 2.6057467460632324, conf: 13.017786026000977
64_totalTrueobjects_loss_xy: 19.2332820892334, loss_wh: 0.3951633870601654, loss_angle: 4.4206318855285645, conf: 35.078147888183594
Max GPU memory usage: 3.3762731552124023 GigaBytes

Total time: 1:30:56.391230, iter: 0:00:10.912782, epoch: 2:48:04.676472
[Iteration 498] [learning rate 0.0001] [Total loss 84.99] [img size 512]
16_total1e-16objects_loss_xy: 0.0, loss_wh: 0.0, loss_angle: 0.0, conf: 0.037787601351737976
32_totalTrueobjects_loss_xy: 11.480892181396484, loss_wh: 0.27258607745170593, loss_angle: 1.4185596704483032, conf: 7.912797927856445
64_totalTrueobjects_loss_xy: 25.00128936767578, loss_wh: 0.3746016025543213, loss_angle: 5.575854301452637, conf: 33.23590087890625
Max GPU memory usage: 3.3762731552124023 GigaBytes

Total time: 1:31:05.636933, iter: 0:00:10.909455, epoch: 2:48:01.601010
[Iteration 499] [learning rate 0.0001] [Total loss 92.20] [img size 512]
16_totalTrueobjects_loss_xy: 1.1926631927490234, loss_wh: 0.007322967518121004, loss_angle: 0.009164094924926758, conf: 2.004310131072998
32_totalTrueobjects_loss_xy: 6.600027084350586, loss_wh: 0.037587691098451614, loss_angle: 1.4872658252716064, conf: 6.406645774841309
64_totalTrueobjects_loss_xy: 26.779525756835938, loss_wh: 0.48900607228279114, loss_angle: 6.459133625030518, conf: 40.99555969238281
Max GPU memory usage: 3.3762736320495605 GigaBytes

Total time: 1:31:14.235448, iter: 0:00:10.904851, epoch: 2:47:57.342678
[Iteration 500] [learning rate 0.0001] [Total loss 113.01] [img size 512]
16_total1e-16objects_loss_xy: 0.0, loss_wh: 0.0, loss_angle: 0.0, conf: 1.008183479309082
32_totalTrueobjects_loss_xy: 9.287293434143066, loss_wh: 0.08602897822856903, loss_angle: 1.5510506629943848, conf: 16.09341812133789
64_totalTrueobjects_loss_xy: 25.234947204589844, loss_wh: 0.4412153363227844, loss_angle: 8.447026252746582, conf: 51.121826171875
Max GPU memory usage: 3.3762731552124023 GigaBytes

duanzhiihao commented 4 years ago

The format of the training log does seem unusual. Did you use the latest version of this repository? Or did you modify the following line? https://github.com/duanzhiihao/RAPiD/blob/0a9440f89b9bf9d17a7b66ba5acabc7cd3c9eb7f/models/rapid.py#L309

However, the numbers look fine to me. Can you try to test and visualize the trained model on some CEPDOF images and check the results?

cs-heibao commented 4 years ago

@duanzhiihao yes, the pretrained model:pL1_MWHB1024_Mar11_4000.ckpt test is ok

cs-heibao commented 4 years ago

@duanzhiihao I modified as yours, and got the follow result, but the same, and why the learning rate is so small? how about your train loss log?

Total time: 0:04:52.641023, iter: 0:00:22.510848, epoch: 5:46:42.661542
[Iteration 11] [learning rate 6.76e-08] [Total loss 168.46] [img size 544]
level_17 total 1 objects: xy/gt 1.428, wh/gt 0.002, angle/gt 0.154, conf 0.384
level_34 total 1 objects: xy/gt 10.540, wh/gt 0.131, angle/gt 2.384, conf 19.865
level_68 total 1 objects: xy/gt 27.822, wh/gt 0.842, angle/gt 11.726, conf 93.668
Max GPU memory usage: 3.7739853858947754 GigaBytes

Total time: 0:05:01.486867, iter: 0:00:21.534776, epoch: 5:31:40.604880
[Iteration 12] [learning rate 7.84e-08] [Total loss 160.44] [img size 544]
level_17 total 0 objects: xy/gt 0.000, wh/gt 0.000, angle/gt 0.000, conf 0.144
level_34 total 1 objects: xy/gt 15.835, wh/gt 0.200, angle/gt 2.414, conf 9.016
level_68 total 1 objects: xy/gt 27.361, wh/gt 0.845, angle/gt 12.563, conf 92.584
Max GPU memory usage: 3.773984909057617 GigaBytes

Total time: 0:05:10.213815, iter: 0:00:20.680921, epoch: 5:18:31.630590
[Iteration 13] [learning rate 9e-08] [Total loss 121.26] [img size 544]
level_17 total 0 objects: xy/gt 0.000, wh/gt 0.000, angle/gt 0.000, conf 10.947
level_34 total 1 objects: xy/gt 16.386, wh/gt 0.183, angle/gt 1.434, conf 14.134
level_68 total 1 objects: xy/gt 16.564, wh/gt 0.470, angle/gt 3.947, conf 57.517
Max GPU memory usage: 3.773984909057617 GigaBytes

Total time: 0:05:18.908381, iter: 0:00:19.931774, epoch: 5:06:59.296779
[Iteration 14] [learning rate 1.02e-07] [Total loss 218.47] [img size 544]
level_17 total 0 objects: xy/gt 0.000, wh/gt 0.000, angle/gt 0.000, conf 0.022
level_34 total 1 objects: xy/gt 9.118, wh/gt 0.101, angle/gt 2.331, conf 29.773
level_68 total 1 objects: xy/gt 37.067, wh/gt 0.961, angle/gt 10.717, conf 128.906
Max GPU memory usage: 3.773984909057617 GigaBytes

Total time: 0:05:27.616841, iter: 0:00:19.271579, epoch: 4:56:49.172433
[Iteration 15] [learning rate 1.16e-07] [Total loss 153.43] [img size 544]
level_17 total 1 objects: xy/gt 3.641, wh/gt 0.050, angle/gt 1.801, conf 7.798
level_34 total 1 objects: xy/gt 6.590, wh/gt 0.022, angle/gt 0.680, conf 11.009
level_68 total 1 objects: xy/gt 34.764, wh/gt 0.561, angle/gt 15.035, conf 71.796
Max GPU memory usage: 3.7739853858947754 GigaBytes

Total time: 0:05:36.311363, iter: 0:00:18.683965, epoch: 4:47:46.116816
[Iteration 16] [learning rate 1.3e-07] [Total loss 135.15] [img size 544]
level_17 total 1 objects: xy/gt 1.351, wh/gt 0.036, angle/gt 0.093, conf 0.208
level_34 total 1 objects: xy/gt 9.287, wh/gt 0.108, angle/gt 1.949, conf 8.007
level_68 total 1 objects: xy/gt 26.259, wh/gt 0.666, angle/gt 8.405, conf 79.191
Max GPU memory usage: 3.7739853858947754 GigaBytes

duanzhiihao commented 4 years ago

Unfortunately, I can't find my training log. To me, the loss numbers that you showed look reasonable. I guess you think the loss should be around 0, but it's not the case here. For example, although x=0.5 is the minimum of binary_cross_entropy(x, 0.5), binary_cross_entropy(0.5, 0.5) is not 0. Also, our loss is 'sum' (but not 'mean'), so typically it will be relatively large.https://github.com/duanzhiihao/RAPiD/blob/0a9440f89b9bf9d17a7b66ba5acabc7cd3c9eb7f/models/rapid.py#L113

I mean, you can wait for some time, and test on CEPDOF images using your trained model. If the detection results look reasonable, then you are fine.

The learning rate is small at the beginning of the training, and it's increasing over time. You can modify the code here if you prefer a larger learning rate. https://github.com/duanzhiihao/RAPiD/blob/0a9440f89b9bf9d17a7b66ba5acabc7cd3c9eb7f/train.py#L153 https://github.com/duanzhiihao/RAPiD/blob/0a9440f89b9bf9d17a7b66ba5acabc7cd3c9eb7f/train.py#L218

cs-heibao commented 4 years ago

@duanzhiihao yes, actually, I think the loss should be around to 0 as my other project it's appreciated for your guidance, I'll train my own dataset and check the model, thanks for your great idea for the project

duanzhiihao commented 4 years ago

You are welcome! Please tell me if you have other questions.

cs-heibao commented 4 years ago

@duanzhiihao another problems, during the train process, in some iteration exists such error:

Traceback (most recent call last):
  File "/*****/RAPiD-master/train.py", line 263, in <module>
    loss = model(imgs, targets, labels_cats=cats)
  File "/*****/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/*****/RAPiD-master/models/rapid.py", line 78, in forward
    boxes_S, loss_S = self.pred_S(detect_S, self.img_size, labels)
  File "/*****/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/*****/RAPiD-master/models/rapid.py", line 285, in forward
    target[b,best_n,truth_j,truth_i,0] = tx_all[b,:n][valid_mask] - tx_all[b,:n][valid_mask].floor()
RuntimeError: copy_if failed to synchronize: device-side assert triggered

duanzhiihao commented 4 years ago

According to https://github.com/facebookresearch/maskrcnn-benchmark/issues/658#issuecomment-481923633, it's because the learning rate is too large.

cs-heibao commented 4 years ago

@duanzhiihao hi, as for CEPDOF datasets it runs well with batchsize of 4 and learning rate follows the scheduler, for my own datasets occurs this. But I make the dataset as the format of CEPDOF datasets, what's at the begining of the training the learning rate is small actually. and by the way, how do I reduce the learning rate further? thanks

cs-heibao commented 4 years ago

@duanzhiihao I've found the problem: maybe since my own data is different of CEPDOF dataset, and after debug, it seems horizontal flip , vertical flip and augUtils.rotate in augment_PIL operation cause the bounding bbox outside images. and I comment the augmentation, runs well

twmht commented 2 years ago

@duanzhiihao

why the angle is 60? why not -60?

duanzhiihao / RAPiD

confused about the format of your labeled datasets CEPDOF #8