facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
8.74k stars 748 forks source link

Segmentation and Depth Heads with Registers? #295

Open legel opened 10 months ago

legel commented 10 months ago

When I try to download the new DinoV2 pre-trained weights with registers (which I'm excited to try, given their better performance on eliminating background noise), it's very unclear right now how to integrate with the Depth or Segmentation heads.

Please indicate if this is possible or if we are still waiting on the release of newly trained heads for the models with registers.

qasfb commented 10 months ago

Thanks for your interest ! For now, the existing heads are not compatible with the register models. We can not make any promises at this point, but this is good signal for us to priorize this !

legel commented 10 months ago

Thanks for the fast reply and useful guidance.

I'm happy to share some feedback and notes for fellow researchers.

We did a comparison of the metric accuracy of the Meta DinoV2 Depth head without registers vs. an Apple iPhone LiDAR sensor that we have found to be fairly accurate.

Here is an input image: 000000

For the backbone ViT-L/14 distilled (300M params) with DPT head trained on NYU-D, following is a depth visualization using the Google Turbo Colormap where the nearest color of dark blue = 0.0 meters and farthest color of dark red = 3.0 meters.

meta_inference

Compare the above for same data for the Apple iPhone LiDAR sensor: apple_sensor

At first glance, you might think the DinoV2 depth estimation is superior because more details appear to be contrasted. However, if you study the scene, there is clearly a big separation in distance between the plant and background which shows up in the Apple depth sensor data but not the Meta monocular depth inference.

This is a dealbreaker for us. I'm noting this, because it's possible the Depth head trained with Registers helps to resolve this.

In any case, I'll note one more thing for the authors and developers behind the Depth estimation efforts: I was pretty shocked that their benchmarks and training was based on such old and error-filled datasets (mostly from over a decade ago). Actually, I think you guys have a fantastic opportunity to demonstrate dramatically superior performance with the underlying DinoV2 backbone, for a Depth head trained on a modern 3D dataset. This is an example dataset released by Apple in 2021 for the same depth sensor visualized above: https://github.com/apple/ARKitScenes Here's a Meta dataset from 2021 that could also work: https://github.com/facebookresearch/co3d More generally, I bet that you could really blow this problem out of the water if you curate a new dataset based on even just a few hundred photorealistic 3D reconstructions based on today's state-of-the-art...

Good luck and thanks to everyone on the team for their hard work, and thanks (and congrats) to Meta for open sourcing what is probably the best set of pre-trained weights for vision tasks today.

qasfb commented 10 months ago

Thanks !

Improving depth is one of the items on our list; in the DINOv2 paper we used well-known benchmarks in order to compare to other models (therefore limiting the amount of annotated training data for the task) and assess the level of generality of our vision model; but depth estimation in itself was not the main focus for this work.

The references that you are pointing to are very relevant and we hope to leverage these in a short future. If by any chance you (any any interested party) happen to train a depth estimator on your side with DINOv2 models, please do let us know !

legel commented 10 months ago

Thanks, will do.

Please also post here if there are updates / progress on your end.

BrianPulfer commented 9 months ago

I am also very interested in this. Please notify me if / when register-based backbones will be supported for segmentation and depth estimation.

It would also be very helpful to add some instructions on how to evaluate the segmentation and depth estimation capabilities of backbone pre-trained models as done in the papers. So far, I am deducting that either mmsegmentation/tools/train.py or mmsegmentation/tools/slurm_train.sh is used with configs from, e.g. here, but I still need to figure out the details to make sure that I am reproducing the experiments right.

xyzhang626 commented 5 months ago

hey @BrianPulfer have you made some progress after this post? It seems DINOv2 is the only pre-trained model evaluated with linear prob in segmentation. No other refs could be found. I'm also trying to reproduce the result. That will be super helpful for me. Thanks in advance.