Poor Object Detection Performance with DINOv2 Backbone and Faster R-CNN Head on Cityscapes Dataset

busenuraktilav commented 10 months ago

I am working on an object detection task using the DINOv2 backbone with a Faster R-CNN head. While I have successfully implemented semantic segmentation with a linear head on the Cityscapes dataset and replicated the results from the relevant paper, I am encountering significant challenges in applying the DINOv2 backbone for object detection.

I used the dinov2_vits14_pretrain model and added a Faster R-CNN head as follows:

def create_model(num_classes):
    backbone = Dinov2Backbone()
    backbone.out_channels = 384  # Set the number of output channels

    downsampling_factor = 16
    feature_map_size = 630 // downsampling_factor

    anchor_size = (feature_map_size,)  # Single size tuple
    anchor_generator = AnchorGenerator(sizes=(anchor_size), aspect_ratios=((0.5, 1.0, 2.0),))
    roi_pooler = MultiScaleRoIAlign(featmap_names=['0'], output_size=7, sampling_ratio=2)

    model = FasterRCNN(backbone, min_size=630, num_classes=num_classes,
                       rpn_anchor_generator=anchor_generator, box_roi_pool=roi_pooler)
    model.transform = IdentityTransform()
    return model

Dataset and Training: I used the Cityscapes dataset, which includes classes such as 'person', 'rider', 'car', 'truck', 'bus', 'train', 'motorcycle', and 'bicycle'. I preprocess the images and by protecting aspect ratio(resizing and padding) I transformed them 630x630. Following the training procedure outlined in the Faster R-CNN tutorial link to the tutorial, I trained the model for 15 epochs.

Issue Encountered: The model's performance is disappointing. It only predicts 'car', 'person' and 'rider' classes, and the accuracy of these predictions is poor. It puts bounding boxes to the unrelated parts of the image and does not even predict the other classes at all (AP scores are 0 for the others but for these three classes AP scores are 0.99). The results are not aligned with the expected performance, considering the model's capabilities in semantic segmentation tasks.

Questions and Assistance Request:

Is there any existing documentation or examples of using DINOv2 for object detection tasks?
Is the DINOv2 backbone suitable for object detection tasks, or is it primarily designed for other purposes like semantic segmentation?
Any suggestions for modifications or alternative approaches to improve object detection results with DINOv2 on the Cityscapes dataset would be greatly appreciated.

Keracles commented 10 months ago

Same problem here

ami-navon commented 10 months ago

+1

hbhflw2000 commented 9 months ago

+1

YHallouard commented 9 months ago

+1

LorenzoFerriniCodes commented 9 months ago

+1

yananielsen commented 9 months ago

I found adjusting learning rate helpful - https://github.com/facebookresearch/dinov2/issues/276#issuecomment-1834232965

BRAINIAC2677 commented 9 months ago

@busenuraktilav I was using your code to configure dinov2_vits14_pretrain with FasterRCNN. But, how did you adapt the embedding output of DINOv2 backbone with RPNHead of FasterRCNN which expects a spatial feature map?

RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 384]

I didn't find any useful resources about this.

fsi070614 commented 8 months ago

any updates?

Ankowa commented 8 months ago

Hi, any updates on this issue?

captainfffsama commented 7 months ago

I attempted to add a simple FPN (Feature Pyramid Network) layer to deformableDETR in mmdetection, following the structure of VitDet, and then ported it over. I trained and tested it on my own dataset, but the results were also poor. During the training process, I froze the parameters of the dinov2 part.

dgcnz commented 3 months ago

Not entirely related, but trained dinov2(frozen)+vitdet+DINO on coco and it seems to perform well

facebookresearch / dinov2

Poor Object Detection Performance with DINOv2 Backbone and Faster R-CNN Head on Cityscapes Dataset #350