Are efficientnet for img encoder and bert for text encoder fixed or partially trainable？

Following Readme, some extra models are required, including chinese-roberta-wwm-ext, used as sub-model of text encoder, and tf_efficientnet_b5_ns-6f26d0cf.pth, used as sub-model of image encoder. (According to BriVL-BUA-applications)

While in ImgLearnableEncoder.init_param function, TextLearnableEncoder.init_param function, We noticed that there are some conditions to control if some params of these backbones, i.e. efficientnet and chinese-roberta-wwm-ext mentioned above, are requires_grad or not, or saying whether these params are trainable.

And these two classes are used in eval from VL_model class.

Thus this eval makes me confused: VL_model is TRAINABLE, which means downloaded official sub-models, efficientnet and chinese-roberta-wwm-ext are NOT satisfied, their finetuned models are required, is there something wrong?

i don't know if i missed some details or mistook something.

Looking forward to your reply:)

BAAI-WuDao / BriVL

Are efficientnet for img encoder and bert for text encoder fixed or partially trainable？ #13