facebookresearch / ConvNeXt-V2

Code release for ConvNeXt V2 model
Other
1.54k stars 120 forks source link

Pre-trained weights incompatible with backbone #65

Open LucFrachon opened 10 months ago

LucFrachon commented 10 months ago

The weights for ConvNeXt-V2-Base, pretrained on INet1k, provided here, have several incompatibilities with the encoder architecture:

These can all be fixed with some code, but it would make life easier for everyone if you could upload the correct weights. Thanks a lot!

blackpearl1022 commented 4 months ago

@LucFrachon Same idea, I have. Do you have any updates on your side ? Thanks !

LucFrachon commented 4 months ago

I wrote code to update the state dict and it worked. I haven't checked if the provided checkpoints have been updated. I've moved on to other things now...

MarkoHaralovic commented 2 months ago

This issue is heavily connected to three other issues: https://github.com/facebookresearch/ConvNeXt-V2/issues/26, https://github.com/facebookresearch/ConvNeXt-V2/issues/33, and https://github.com/facebookresearch/ConvNeXt-V2/issues/47.

You correctly observed the mismatch in the layer naming between model weights and checkpoint weights, as well as the shape mismatch. Certainly, I tried reshaping and renaming the weights, and it was successful, but I would argue that the model won't reconstruct the masked images as intended, if used for that purpose.

If you observe the weights more closely, you could see another issue: there are some weights present in the model but not found in the checkpoint state dictionary. This is due to the fact that the pretrained weights of all the models available on the main page of this repository only have the encoder structure present in the .pt files, whereas the decoder was cut off.

For that reason, this isn't a proper solution if you are using the model for any reconstruction task, which I'd argue would be the main use case. The fine-tuning of those weights is still successful due to the fact that fine-tuning for any downstream task (classification, object detection, segmentation, etc.) will construct a decoder, and for that reason, these pretrained weights are still useful.

The absence of the decoder head in the .pt files is the main cause of these visualization issues people often come across when trying to reconstruct the masked image, i.e., when using the pretrained weights in a reconstruction task. These are the visualization issues I found raised, which I'd say are covered by this comment: https://github.com/facebookresearch/ConvNeXt-V2/issues/48, https://github.com/facebookresearch/ConvNeXt-V2/issues/42.

The visualization itself is probably not wrong in each of those cases. I myself tried and succeeded in visualizing the reconstructions using code from MAE (https://github.com/facebookresearch/mae/blob/main/demo/mae_visualize.ipynb), and all the images contained nothing but white noise in the masked areas (reconstruction was mere noise, per se). I verified that the pretrained weights output the same noise as the randomly initialized and not pretrained model. That makes sense knowing the decoder weights are not present in the .pt files, meaning that the reconstruction part was practically a reconstruction using randomly initialized weights. Now, that means reconstruction won't be successful per se, even if the weights are renamed and reshaped. The solution is to train a decoder head.

I indeed pretrained a ConvNeXt-V2 Atto model myself and saved all the weights, both of the encoder and decoder. Then I went back to visualization and saw that the reconstruction code worked, and these weight mismatches weren't a problem anymore, as all the keys matched successfully. The visualization code, among other things, can be found on my GitHub profile in the forked and modified version of this repository (https://github.com/MarkoHaralovic/ConvNeXt-V2).

watertianyi commented 1 month ago

@MarkoHaralovic https://github.com/huggingface/pytorch-image-models/issues/1922#issuecomment-2407036555