retinet (or focal loss) accuracy issue

facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

Apache License 2.0

26.24k stars 5.45k forks source link

retinet (or focal loss) accuracy issue #93

Closed jwnsu closed 6 years ago

jwnsu commented 6 years ago

Anyone able to reproduce retinet (FPN) accuracy, on any dataset?

Tried it on a new dataset and got VOC2012 accuracy around 0.52 mAP (ResNet50). Same dataset with mask faster rcnn FPN (ResNet50) from this repository got >0.88 mAP (dataset settings are identical in both, SCALES/NMS too). Learn rates adjustment does not help. Plan to change more parameters (alpha, gamma etc.)

Further inspection shows 6 to 7 objects got 0 mAP in retinet case, very unusual.

Debug suggestions are welcome.

rbgirshick commented 6 years ago

The fact that some classes have 0 AP is very unusually indeed and makes it sound like there's a bug somewhere (we have actually never tried training and testing on VOC). Random guess: I would look into places that depend on the number of classes in the dataset (e.g., where class score are computed for each anchor). Perhaps there is something that is not handled properly when going from the 81 classes in COCO to the 21 in VOC?

tylin commented 6 years ago

From the description the mAP of non-zero categories is 0.8 given 7 categories have 0 AP. It looks like there are some mapping error of category ids for those 7 categories.

jwnsu commented 6 years ago

Thx for the info. We tried the same coco instances files with mask rcnn fpn in this repository, accuracy from mask fpn is very good (good accuracy for all objects.) It seems retinet code path uses its own data preparation (roi_data/retinanet.py), is it possible some bugs there?

wgting96 commented 6 years ago

I am new to Caffe2 framework. I am not sure if the retinanet misses to set the number of classes?

https://github.com/facebookresearch/Detectron/blob/021685d42f7e8ac097e2bcf79fecb645f211378e/lib/modeling/retinanet_heads.py#L280-L304

Should it be something like the following codes if I would like to train the retinanet on my custom dataset?

            cls_focal_loss, gated_prob = model.net.SoftmaxFocalLoss(
                [
                    cls_lvl_logits, 'retnet_cls_labels_' + suffix,
                    'retnet_fg_num'
                ],
                ['fl_{}'.format(suffix), 'retnet_prob_{}'.format(suffix)],
                gamma=cfg.RETINANET.LOSS_GAMMA,
                alpha=cfg.RETINANET.LOSS_ALPHA,
                num_classes=model.num_classes,
                scale=model.GetLossScale(),
            )

kampelmuehler commented 6 years ago

Any news on this? Struggling with the same behavior on ILSVRC

@wgting96 thanks for pointing this out, since num_classes is indeed used in caffe2/modules/detectron/sigmoid_focal_loss_op.cu and according to doc defaults to 80, which in turn might be causing the strange behavior. Testing this right now.

It's indeed the bug that prevents RetinaNet from learning with num_classes != 81. Tested with Sigmoid Focal Loss.

twmht commented 6 years ago

@rbgirshick

there is no test for sigmoidFocalLoss layer, I tried to migrate this implementation to caffe, and wrote a gradient checker for that, but got some of big gradient difference ( the gradient difference between analytical gradient and numerical gradient is greater than 0.01) .

So I guess that there are some bugs in the sigmoidFocalLoss layer.

jwnsu commented 6 years ago

The accuracy is much improved (from 0.5 to >0.75), however, there are still 4 objects with 0 mAP (reduced from 7 to 8 0 mAP previously), those 4 objects all have >0.9 mAP with faster rcnn. Tried ResNxt 32x8, the same 4 objects still have 0 mAP. Likely, either retinanet's training data generation or focus training does not generate training for those 4 objects. The 4 0-mAP objects have id 3,4,5,8.