Open busenuraktilav opened 10 months ago
Same problem here
+1
+1
+1
+1
I found adjusting learning rate helpful - https://github.com/facebookresearch/dinov2/issues/276#issuecomment-1834232965
@busenuraktilav
I was using your code to configure dinov2_vits14_pretrain
with FasterRCNN. But, how did you adapt the embedding output of DINOv2 backbone with RPNHead of FasterRCNN which expects a spatial feature map?
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 384]
I didn't find any useful resources about this.
any updates?
Hi, any updates on this issue?
I attempted to add a simple FPN (Feature Pyramid Network) layer to deformableDETR in mmdetection, following the structure of VitDet, and then ported it over. I trained and tested it on my own dataset, but the results were also poor. During the training process, I froze the parameters of the dinov2 part.
Not entirely related, but trained dinov2(frozen)+vitdet+DINO on coco and it seems to perform well
I am working on an object detection task using the DINOv2 backbone with a Faster R-CNN head. While I have successfully implemented semantic segmentation with a linear head on the Cityscapes dataset and replicated the results from the relevant paper, I am encountering significant challenges in applying the DINOv2 backbone for object detection.
I used the dinov2_vits14_pretrain model and added a Faster R-CNN head as follows:
Dataset and Training: I used the Cityscapes dataset, which includes classes such as 'person', 'rider', 'car', 'truck', 'bus', 'train', 'motorcycle', and 'bicycle'. I preprocess the images and by protecting aspect ratio(resizing and padding) I transformed them 630x630. Following the training procedure outlined in the Faster R-CNN tutorial link to the tutorial, I trained the model for 15 epochs.
Issue Encountered: The model's performance is disappointing. It only predicts 'car', 'person' and 'rider' classes, and the accuracy of these predictions is poor. It puts bounding boxes to the unrelated parts of the image and does not even predict the other classes at all (AP scores are 0 for the others but for these three classes AP scores are 0.99). The results are not aligned with the expected performance, considering the model's capabilities in semantic segmentation tasks.
Questions and Assistance Request: