farewellthree / STAN

Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring"
Apache License 2.0
90 stars 3 forks source link

Possible bug in initialising pretrained model? #16

Closed FransHk closed 6 months ago

FransHk commented 7 months ago

Hi, I came across a possible bug in the code that loads a pretrained model and would like clarification. The constructor of the VitCLIPPretrained class (the super class of ViTCLIPPretrained_STAN) has a default value for 'pretrained_model', namely 'openai/clip-vit-base-patch32' .

@MODELS.register_module()
class VITCLIPPretrained(nn.Module):
    def __init__(self, pretrained_model="openai/clip-vit-base-patch32", clip_weight=None, return_mean=True, 
                 patch_3d=False, return_all=False, **kwargs):
        super().__init__()
        print ("==== CLIP Vision ====")
        if clip_weight:
            configuration = CLIPConfig().from_pretrained(clip_weight)
            clip_model = CLIPModel.from_pretrained(clip_weight, config=configuration)
        else:
            #print("Key pretrained: {}".format(key))
            configuration = CLIPConfig().from_pretrained(pretrained_model)
            clip_model = CLIPModel.from_pretrained(pretrained_model, config=configuration)

However, in the child class where the super class is initialised, we only have that super().__init__(clip_weight=clip_weight, **kwargs), meaning that the super class 'pretrained_model' will always be set to the clip-vit-base-patch32.

@MODELS.register_module()
class VITCLIPPretrained_STAN(VITCLIPPretrained):
    def __init__(self, depth=4, cls_residue=False, time_module="selfattn",
        pretrained_model="openai/clip-vit-base-patch32", clip_weight=None, all_patch=False,
        gradient_checkpointing=False, **kwargs): 
        print("VITCLIPPretrained_STAN called with pretrained model: {}", pretrained_model)

        # pass CLIP weights to VITCLIPPretrained        
        super().__init__(clip_weight=clip_weight, **kwargs)

        if clip_weight:
            configuration = CLIPConfig().from_pretrained(clip_weight)
        else:
            configuration = CLIPConfig().from_pretrained(pretrained_model)

is this intended behaviour? Setting pretrained_model=openai/clip-vit-base-patch16 causes the VitCLIPPretrainedSTAN to load its config from the patch16 model whereas its superclass loads its configs from the patch32 model.

farewellthree commented 6 months ago

Thanks. It looks like a bug. We used to load CLIP weights from the local and omitted this problem

FransHk commented 6 months ago

Yeah, that's what I thought. Thanks for clarifying.