facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
8.74k stars 751 forks source link

Evaluation on Segmentation and Depth estimation for backbones #328

Open BrianPulfer opened 9 months ago

BrianPulfer commented 9 months ago

While instruction for evaluating classification capabilities are well reported, currently, there is no information on how to evaluate pre-trained backbone DinoV2 models on segmentation / detph estimation (i.e. keep the backbone frozen while segmentation / depth estimation head/s are trained and measure metrics).

The closest I got to getting evaluation working was trying to run mmsegmentation/tools/train.py (from version 0.27 of mmseg) on a single node using the configuration for ViTs. This still results in some TypeError.

I would be very grateful if someone could point me in the right direction. I'd also be happy to include this information in the README.

BrianPulfer commented 9 months ago

Another confusing thing is the following: In the DinoV2 paper, it is mentioned that for segmentation images of size 512x512 are processed in patches of size 16x16 to obtain a 32x32 feature map that is then bi-linearly interpolated to the target output size.

However, pre-trained models all come with a 14x14 patch embedding layer. Is the paper incorrect in this regard, or should one use different pre-trained backbones? If so, where can they be found?

sohamghosh121 commented 8 months ago

I believe to get around the 14x14 patch size issue they add some center padding with zeros to get around it (e.g.)

+1 for needing some pointers to be able to run segmentation and depth evals out of the box.

adipill04 commented 6 months ago

Hi Brian. I'm also in the same boat as you. Did you happen to find a way to evaluate the pretrained DinoV2?

BrianPulfer commented 6 months ago

Yes, I managed to train linear classifiers on top of the pre-trained backbone for segmentation. It was a bit of an Odyssey:

You want to install version 0.27 of mmseg locally with pip install -e .

Then, you want to modify the library with the files shared here. Substitute the empty classes with their implementations (e.g. dinov2/eval/segmentation/models/backbones/vision_transformer.py with dinov2/models/vision_transformer.py, but keep the @BACKBONES.register_module() decorator). Also, modify the __init__.py files in mmseg/models/backbones/__init__.py and mmseg/models/decode_heads/__init__.py to include the custom backbone and heads.

Finally, you can run the training script from mmsegmentation/tools/train.py using these configurations (you can get configs for other sizes by substituting the s in "vits" with b, l or g).

The configs do not seem to work out of the box, so I had to modify a few things (i.e. changed image size from 512 to 518 everywhere in the file, since 512 cannot be divided in patches of size 14x14. Note that this results in 1369 patches of size 14x14 instead of 1024 patches of size 16x16. If you want 1024 patches, the image size has to be reduced to 448x448). Here's an example of how I had to change the model and head keys in the config for the small model:

model = dict(
    type="EncoderDecoder",
    pretrained="path/to/pretrained.pth",
    backbone=dict(
        type="DinoVisionTransformer",
        out_indices=[8, 9, 10, 11],
        img_size=518,
        block_chunks=0,
        init_values=1,
        patch_size=14,
        embed_dim=384,
        depth=12,
        num_heads=6,
        mlp_ratio=4,
    ),
    decode_head=dict(
        type="BNHead",
        in_channels=[384],
        in_index=[3],
        input_transform="resize_concat",
        channels=384,
        dropout_ratio=0,
        num_classes=150,
        norm_cfg=dict(type="SyncBN", requires_grad=True),
        align_corners=False,
        loss_decode=dict(type="CrossEntropyLoss", use_sigmoid=False, loss_weight=1.0),
    ),
    test_cfg=dict(mode="slide", crop_size=(518, 518), stride=(341, 341)),
)

I gave up on doing the same for Depth estimation, as the NYUd Dataset is missing for that version of MMSegmentation, and the files for making it work are not shared as far as I can tell.

Let me know if I forgot something 😅