czczup / ViT-Adapter

[ICLR 2023 Spotlight] Vision Transformer Adapter for Dense Predictions
https://arxiv.org/abs/2205.08534
Apache License 2.0
1.24k stars 138 forks source link

mask2former_beit_adapter_large_896_80k_cocostuff164k.pth.tar is not a checkpoint file #36

Open Tsardoz opened 2 years ago

Tsardoz commented 2 years ago

I want to use these weights as a pretrained model for use with a smaller subset of cocostuff data. When I change

pretrained = 'pretrained/beit_large_patch16_224_pt22k_ft22k.pth'

to

pretrained = 'pretrained/mask2former_beit_adapter_large_896_80k_cocostuff164k.pth.tar'

I get this error message. How can I use your pretrained weights in a new model of the same structure with different data?

czczup commented 2 years ago

I want to use these weights as a pretrained model for use with a smaller subset of cocostuff data. When I change

pretrained = 'pretrained/beit_large_patch16_224_pt22k_ft22k.pth'

to

pretrained = 'pretrained/mask2former_beit_adapter_large_896_80k_cocostuff164k.pth.tar'

I get this error message. How can I use your pretrained weights in a new model of the same structure with different data?

Hi, this file mask2former_beit_adapter_large_896_80k_cocostuff164k.pth.tar is the checkpoint of the entire model, including backbone and head. If you want to use this checkpoint as initialization and fine-tune on other dataset, you should use load_from rather than pretrained (used for loading pretrained backbone), like this:

load_from = 'pretrained/mask2former_beit_adapter_large_896_80k_cocostuff164k.pth.tar'
Tsardoz commented 2 years ago

Thanks. Do I still need to define the model if it is loaded by load_from? If I delete this I get the error

KeyError: 'XCiT is not in the models registry'

model = dict( type='EncoderDecoderMask2Former', pretrained=pretrained, backbone=dict( type='BEiTAdapter', img_size=896, patch_size=16, embed_dim=1024, depth=24, num_heads=16, mlp_ratio=4, qkv_bias=True, use_abs_pos_emb=False, use_rel_pos_bias=True, init_values=1e-6, drop_path_rate=0.3, conv_inplane=64, n_points=4, deform_num_heads=16, cffn_ratio=0.25, deform_ratio=0.5, with_cp=True, # set with_cp=True to save memory interaction_indexes=[[0, 5], [6, 11], [12, 17], [18, 23]], ), decode_head=dict( in_channels=[1024, 1024, 1024, 1024], feat_channels=1024, out_channels=1024, num_queries=200, pixel_decoder=dict( type='MSDeformAttnPixelDecoder', num_outs=3, norm_cfg=dict(type='GN', num_groups=32), act_cfg=dict(type='ReLU'), encoder=dict( type='DetrTransformerEncoder', num_layers=6, transformerlayers=dict( type='BaseTransformerLayer', attn_cfgs=dict( type='MultiScaleDeformableAttention', embed_dims=1024, num_heads=32, num_levels=3, num_points=4, im2col_step=64, dropout=0.0, batch_first=False, norm_cfg=None, init_cfg=None), ffn_cfgs=dict( type='FFN', embed_dims=1024, feedforward_channels=4096, num_fcs=2, ffn_drop=0.0, with_cp=True, # set with_cp=True to save memory act_cfg=dict(type='ReLU', inplace=True)), operation_order=('self_attn', 'norm', 'ffn', 'norm')), init_cfg=None), positional_encoding=dict( type='SinePositionalEncoding', num_feats=512, normalize=True), init_cfg=None), positional_encoding=dict( type='SinePositionalEncoding', num_feats=512, normalize=True), transformer_decoder=dict( type='DetrTransformerDecoder', return_intermediate=True, num_layers=9, transformerlayers=dict( type='DetrTransformerDecoderLayer', attn_cfgs=dict( type='MultiheadAttention', embed_dims=1024, load_ num_heads=32, attn_drop=0.0, proj_drop=0.0, dropout_layer=None, batch_first=False), ffn_cfgs=dict( embed_dims=1024, feedforward_channels=4096, num_fcs=2, act_cfg=dict(type='ReLU', inplace=True), ffn_drop=0.0, dropout_layer=None, with_cp=True, # set with_cp=True to save memory add_identity=True), feedforward_channels=4096, operation_order=('cross_attn', 'norm', 'self_attn', 'norm', 'ffn', 'norm')), init_cfg=None) ), test_cfg=dict(mode='slide', crop_size=crop_size, stride=(512, 512)) )