chengyangfu / retinamask

RetinaMask
MIT License
339 stars 52 forks source link

Multi-scale training #16

Closed gurkirt closed 5 years ago

gurkirt commented 5 years ago

Thank you for making the code available online. Really solid work!

Where do you specify multi-scale training?

In the caption of figure one of the paper (https://arxiv.org/pdf/1901.03353.pdf), it is mentioned that you do not use multi-scale training.

Looking at https://github.com/chengyangfu/retinamask/blob/master/maskrcnn_benchmark/data/transforms/build.py#L10.

It seems you always use multiscale resizing option which seems depends on the number of min scales. If we were to specify multiple min-sizes in the config files, we would get multi-scale training. Is that right?

Can you point me config file where you do that? Or you don't do that at all.

How the batch is handled given all the images could have arbitrary second dimension resulting in arbitrary feature map size and a different number of predictions depending on input image size.

Many thanks Gurkirt

chengyangfu commented 5 years ago

Hi @gurkirt , Sorry for the late reply. I was really busy in the last week. First, for the multi-scale training, if you only put one number there, then this will become a single-scale training. For example, in this case(https://github.com/chengyangfu/retinamask/blob/master/configs/retina/retinanet_R-50-FPN_1x.yaml#L30), the short side will be resized to 800. If you change MIN_SIZE_TRAIN to (600,800,), then during training, the short side will be randomly resized to 600 or 800.

For the second dimension, you can check the details here (https://github.com/chengyangfu/retinamask/blob/master/maskrcnn_benchmark/structures/image_list.py#L48). The idea is padding the second dimension to make all the images in the same patch have the same size. The new second dimension is decided by taking the maximum of all second dimensions of the images in the batch.

gurkirt commented 5 years ago

Hi @chengyangfu,

Thank you for your answer, It helps a lot. At the moment I am simply resizing images to a fixed size on both the side. Should this have a huge impact on the performance? e.g. 600x600.

I am trying to compare the different loss functions for single stage object detectors. I am using FPN as a base model and add a different kind of loss function. FPN + OHEM (SSD style) Focal loss Yolo loss

Many thanks, Gurkirt

chengyangfu commented 5 years ago

Hi @gurkirt
If you changed the image aspect ratio, you may need the complicated data augmentation to compensate it. That's why data augmentation is so important to the performance of YOLO and SSD.

chengyangfu commented 5 years ago

For the anchor boxes, If you follow RetinaNet style, the prediction head will be shared across all prediction layers. However, if you choose the k-mean approach, it definitely fits the data distribution better. But you need different prediction head for each prediction layer.

For the 600x600 resolution,  I think FocalLoss may not be really

important. The motivation of FocalLoss is to solve the extreme ratio of negative and positive examples. So, when you use the lower resolution, this ratio also changes.

On Wed, Jun 5, 2019 at 8:50 AM Gurkirt Singh notifications@github.com wrote:

Hi @chengyangfu https://github.com/chengyangfu,

Thank you for your answer, It helps a lot. At the moment I am simply resizing images to a fixed size on both the side. Should this have a huge impact on the performance? e.g. 600x600.

I am trying to compare the different loss functions for single stage object detectors. I am using FPN as a base model and add a different kind of loss function.

  1. FPN + OHEM (SSD style)
  2. Focal loss
  3. Yolo loss

For regression, I am using smooth_l1_loss with beta = 0.11 https://github.com/gurkirt/RetinaNet/blob/master/modules/detection_loss.py#L12 .

Freeze first two layers of the backbone network https://github.com/gurkirt/RetinaNet/blob/master/modules/solver.py#L25.

Resnet as back for FPN https://github.com/gurkirt/RetinaNet/blob/master/models/resnetFPN.py.

My code is based on my previous FPN implementations ( https://github.com/gurkirt/FPN.pytorch1.0).

Loss functions are implemented here https://github.com/gurkirt/RetinaNet/blob/master/modules/detection_loss.py .

My experiments with 600x600 image as input indicate that multi-box loss is better than focal loss. I am sure I am making some mistake. I follow all the guidelines for freezing the model and initialization of each layer with 0.01 as std and bias for the last classification layer. Both multi-box loss and focal loss coverage with https://github.com/chengyangfu/retinamask/blob/master/configs/retina/retinanet_R-50-FPN_1x.yaml settings.

My anchors are not fixed I computed them using k-means on coco dataset, the k-means anchor is better than standard anchor ( see https://github.com/gurkirt/FPN.pytorch1.0#performance). original anchors were generated similar to SSD anchor generation process.

The problem I am having is Multibox loss is better than focal loss by keeping other things constant. `` Input size 3x600x600 learining rate = 0.01 batch size = 16 max_iter=90000 step=60000,80000 step_gamma=0.1 backbone=resnet50

``

Focal loss results are

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.291 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.476 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.303 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.133 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.324 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.400

Multibox loss results are:

`` Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.309 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.502 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.324 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.137 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.335 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.438

``

Of course, my evaluation is together for all the scale, I apply nms for each class after 0.05 threshold and picking top 2000. After nms top 100 are picked for final evaluation (see https://github.com/gurkirt/RetinaNet/blob/master/evaluate.py#L312).

I tried everything, any help would be much appreciated.

Cheers Gurkirt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chengyangfu/retinamask/issues/16?email_source=notifications&email_token=AA5J6MY4GO7LVPGR7V6NC63PY6Y7XA5CNFSM4HQBPQQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW7TAXI#issuecomment-499069021, or mute the thread https://github.com/notifications/unsubscribe-auth/AA5J6M6YTNTQ7BHO43R7ET3PY6Y7XANCNFSM4HQBPQQQ .

-- Sincerely, Cheng-Yang Fu Computer Vision Lab, Sitterson Hall Phone : 919-627-2577 E-mail : cyfu@cs.unc.edu

gurkirt commented 5 years ago

Hi @chengyangfu

Thank you very much for your insightful comments.

At the moment with shared heads across different pyramid levels. And, regression and classification towers are independent of each other, same as retinenet.

You can see here that Kmeans based anchor does help even with sheared heads. But, here. there is heavy data augmentation, in this case, the whole network is trained.

My focal loss implementation was wrong, I think I got right now, here.

Focal loss does help with 600x600 resolution but not a lot. There is no data augmentation in this case and k-means anchors are used. 3 anchors at each cell of the feature map grid. I will try with 3 pre-defined anchors see if Focal loss helps more. As you said, the focal loss should shine where there are a lot more negatives than positives.

Input size 3x600x600
learining rate = 0.01
batch size = 16
max_iter=90000
step=60000,80000
step_gamma=0.1
backbone=resnet50

Multibox loss results are:

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.309
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.502
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.324
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.137
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.335
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.438

Focal loss results are

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.312
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.510
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.324
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.145
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.345
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.432
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.287
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.467
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.514
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.323
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.552
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.673
gurkirt commented 5 years ago

Of course, my evaluation is together for all the scale, I apply nms for each class after 0.05 threshold and picking top 2000. After nms top 100 are picked for final evaluation (see https://github.com/gurkirt/RetinaNet/blob/master/evaluate.py#L312).

How much the difference evaluation makes when nms applied on all the pyramid levels independently.

gurkirt commented 5 years ago

Another question, are cell anchors independents of feature map size and image size? Shouldn't these be correlated to input images size as a result feature map size?

chengyangfu commented 5 years ago

Applying evaluation on each pyramid levels independently is important if your input resolution is high. In your case 600x600, I think it's minor.

I follow the RetinaNet to set up all the anchors. You can check the details in their paper. Anchors size are different on different feature layers but they are independent with the input image.

gurkirt commented 5 years ago

@chengyangfu, sorry to bother you again, I promise this the last question. Thank you very much for your clear explanation.

It is clear that you use the same anchors as original retinanet paper. Target bounding boxes are resized or scaled according to input image dimension here called from here. Then why not scale anchors according to image size as well, like in SSD.

chengyangfu commented 5 years ago

Currently, I just prefer RetinaNet's method(which does not scale the anchor boxes). You can treat the prediction layers in P3 (detect 32 + 32(1/3) + 32(2/3) as the general detectors which focus on these sizes, no matter what's your input resolution. This also makes the multi-scale training/Inference very simple. Once again, because you don't need to resize the anchors.

On Thu, Jun 6, 2019 at 12:34 PM Gurkirt Singh notifications@github.com wrote:

@chengyangfu https://github.com/chengyangfu, sorry to bother you again, I promise this the last question. Thank you very much for your clear explanation.

It is clear that you use the same anchors as original retinanet paper. Target bounding boxes are resized or scaled according to input image dimension here https://github.com/chengyangfu/retinamask/blob/master/maskrcnn_benchmark/structures/bounding_box.py#L91 called from here https://github.com/chengyangfu/retinamask/blob/master/maskrcnn_benchmark/data/transforms/transforms.py#L58. Then why not scale anchors according to image size as well, like in SSD.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chengyangfu/retinamask/issues/16?email_source=notifications&email_token=AA5J6M57MBM2YBOE3I2HU5TPZE4DFA5CNFSM4HQBPQQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXDN25I#issuecomment-499572085, or mute the thread https://github.com/notifications/unsubscribe-auth/AA5J6MZH2XSGNZTBHHZB3ZDPZE4DFANCNFSM4HQBPQQQ .

-- Sincerely, Cheng-Yang Fu Computer Vision Lab, Sitterson Hall Phone : 919-627-2577 E-mail : cyfu@cs.unc.edu