Convert DINOv2 ViT Weights to ViT-Adapter

MatCorr commented 11 months ago

In the recently released segmentation notebook, a trained Mask2Former segmenter is loaded. In its strucutre, it's possible to see a ViT-Adapter is used as a backbone and not a standard ViT, which is what DINOv2 produces.

So my question is, how was that model trained? I'm assuming the weights produced by DINOv2 were loaded into a ViT-Adapter (via some sort of conversion) and then the Mask2Former structure was trained using mmsegmentation, but it's not clear how that was done.

Am I missing something? How was that conversion done?

dillonalaird commented 11 months ago

ViT-Adapter wraps around the DINOv2 model with injector and extractor modules, see the paper here so all you need to do is build the ViTAdapter model from here and pass in the DINOv2 backbone as the pretrained weights. In the DINOv2 paper in the segmentation section you can see they train the adapter weights and the head but keep the backbone frozen.

MatCorr commented 11 months ago

Ok, thanks!

One thing is still not clear to me, though. Do we have the script for training the Mask2Former model?

dillonalaird commented 11 months ago

It's run using MMLab, specifically MMSegmentation. You can follow the notebook here to load the mmsegmentation config file used to run the model. You may have to modify some of the configuration, I was able to train a smaller DINO backbone with ViT adapter and Mask2Former head but it took some time to get everything working.

MatCorr commented 11 months ago

Thanks a bunch for the thoughtful response.

I had tried training through MMSegmentation but bumped into some odd errors, so I thought that maybe the training had been done in another way. Since you made it work, I'll go back to trying.

AlessioQuercia commented 10 months ago

It's run using MMLab, specifically MMSegmentation. You can follow the notebook here to load the mmsegmentation config file used to run the model. You may have to modify some of the configuration, I was able to train a smaller DINO backbone with ViT adapter and Mask2Former head but it took some time to get everything working.

There are weights mismatches when loading the DINOv2 backbone state_dict in ViTAdapter. See below:

lilong-epfl commented 5 months ago

It's run using MMLab, specifically MMSegmentation. You can follow the notebook here to load the mmsegmentation config file used to run the model. You may have to modify some of the configuration, I was able to train a smaller DINO backbone with ViT adapter and Mask2Former head but it took some time to get everything working.

Hi，I am try to train DINO backbone with ViT adapter, but I got "NotImplementedError: You must implement either the backward or vjp method for your custom autograd.Function to use it with backward mode AD." Error. It looks like some part of code is missing, Did you met the same issue? Thanks!

MatCorr commented 5 months ago

It's run using MMLab, specifically MMSegmentation. You can follow the notebook here to load the mmsegmentation config file used to run the model. You may have to modify some of the configuration, I was able to train a smaller DINO backbone with ViT adapter and Mask2Former head but it took some time to get everything working.

There are weights mismatches when loading the DINOv2 backbone state_dict in ViTAdapter. See below:

Yeah, the DINOv2 weights are slightly different from the ones expected by MMSegmentation / ViTAdapter. You are going to need to convert their labels / keys.

It's run using MMLab, specifically MMSegmentation. You can follow the notebook here to load the mmsegmentation config file used to run the model. You may have to modify some of the configuration, I was able to train a smaller DINO backbone with ViT adapter and Mask2Former head but it took some time to get everything working.

Hi，I am try to train DINO backbone with ViT adapter, but I got "NotImplementedError: You must implement either the backward or vjp method for your custom autograd.Function to use it with backward mode AD." Error. It looks like some part of code is missing, Did you met the same issue? Thanks!

I never had that error, sorry. =/

hubhub086 commented 4 months ago

It's run using MMLab, specifically MMSegmentation. You can follow the notebook here to load the mmsegmentation config file used to run the model. You may have to modify some of the configuration, I was able to train a smaller DINO backbone with ViT adapter and Mask2Former head but it took some time to get everything working.

There are weights mismatches when loading the DINOv2 backbone state_dict in ViTAdapter. See below:

it seems like your dinov2_checkpoint use swiglufused as ffn_layer，but vitadapter use normal Mlp. Maybe you need to replace the Mlp layer with SwiGLUFFNFused layer in vitadapter

Vishwesh4 commented 4 months ago

It's run using MMLab, specifically MMSegmentation. You can follow the notebook here to load the mmsegmentation config file used to run the model. You may have to modify some of the configuration, I was able to train a smaller DINO backbone with ViT adapter and Mask2Former head but it took some time to get everything working.

Hi，I am try to train DINO backbone with ViT adapter, but I got "NotImplementedError: You must implement either the backward or vjp method for your custom autograd.Function to use it with backward mode AD." Error. It looks like some part of code is missing, Did you met the same issue? Thanks!

The class "MSDeformAttnFunction" in this repository seems to be missing the backward function. If you want to train the adapter, you can refer to the code in this link, which has the backward function as well.

facebookresearch / dinov2

Convert DINOv2 ViT Weights to ViT-Adapter #241