e-bug / volta

[TACL 2021] Code and data for the framework in "Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs"
https://aclanthology.org/2021.tacl-1.58/
MIT License
113 stars 24 forks source link

ViLBERT pretrained weights not the same as official weights #15

Closed ivana-13 closed 2 years ago

ivana-13 commented 2 years ago

Hello, I wanted to ask if you pretrained vilbert weights yourself using train_concap.py. If the answer is yes, than I guess there is some mistake. I checked your pretrained weights with the one facebook github (link is here: https://github.com/facebookresearch/vilbert-multi-task, the weights are down below Visiolinguistic Pre-training and Multi Task Training ) and the weights are not the same.

What did I do? I have run this code without being interested in the output as such. python train_task.py --config_file config/vilbert_base.json --from_pretrained vilbert.bin --tasks_config_file config_tasks/vilbert_tasks.yml --task 16 --adam_epsilon 1e-6 --adam_betas 0.9 0.999 --adam_correct_bias --weight_decay 0.0001 --warmup_proportion 0.1 --clip_grad_norm 1.0 --output_dir checkpoints/foil/vilbert --logdir log/foil I added this code ( line model = ... is alredy there just so you know where I added the printing)

model = BertForVLTasks.from_pretrained(args.from_pretrained, config=config, task_cfg=task_cfg, task_ids=[task])
    zero_shot = True
    for name, param in model.named_parameters():
        if "bert.encoder.layer.12.attention_self.v_query.weight" in name:
            print(name)
            print(param)
        if "bert.encoder.layer.12.attention_self.v_query.bias" in name:
            print(name)
            print(param)

I did the same for pretrained weights grom facebook github.

What was the output?

I understand weights can be different after pretraining because of many things but here I observed that the bias weight are always non-zero in your weights and zero for facebook weights. The difference I found is in all the attention_self query, keys and values layers biases and that yours are non-zero and from facebook they are zero. I also tried different number of layer (0, 14,...) and I have tried it for v_query and query, and also key and values and the result was always the same. As your train_concap and facebooks train_concap both start from bert_base_uncased I geuss there is some small detail wrong in your model. I didnt found the reason for this, but to me it looks like there should be something setting biases to zero and same other small change in the model ( which to me quite makes sense as in the attention mechanism, when we are creating K,V and Q we only use weights and no bias). I have tried even evaluate some tasks with these two pretrained weights and they lead to different results so something probably should be different. Can you look at this? Thank you.

e-bug commented 2 years ago

Hi, thanks for checking out the repo!

Yes, our weights are different from the ones in the 12-in-1 repository. We have indeed pretrained ViLBERT on Conceptual Captions through train_concap.py. Also, the layers have been re-defined to match the framework that we introduce in our TACL paper. As such, 12-in-1 weights won't match our architecture. But you should be able to write a conversion script (see conversions/ for some example).

As for attention definition, ours (e.g. here) matches 12-in-1's (e.g. here) and transformers', from which models are initialised.

ivana-13 commented 2 years ago

Hi, thanks for your answer.

As I said I understand there will be a little differences as the architecture in your paper is little different. Up until this moment I found the difference in encoders.py in that you slightly different use self.apply(self.init_weights) as in the 12-in-1 repository. When I changed that the results using pretrained weights from facebook on your framework are much closer to the ones I am gaining from facebooks vilbert.

About the biases, I know the code is the same, still somehow their biases are zeros when I print it. I didnt find the solution for that maybe there is something making the biases not to train but I dont know how (they didnt changed requires_grad for them so I dont know what is the other option).

ivana-13 commented 2 years ago

I can give more information about what am I doing and therefore why I think there is something not right. In the facebook repository there is also vilbert_tasks.yml that contains information and dataroots for given tasks. One of them is task 16, classification which is done on FOIL database. There are pais image-description and a feature called foil (boolean) which states whether there is word changed in the caption (foil = true), therefore the caption is not quite right or not (false). I have run this task using pretrained weights and model from facebook in zero-shot way, so without any other training on the database, for classification using the already pretrained classifier in transformer. I got the following results: average accuracy 55.5, acuracy on the positive samples (matching samples) 98.8, accuracy on negative samples (mot-matching pairs) 12.2. Doing the same for your model and your weights I am getting something like this: average accuracy 50.0, positive accuracy 0.17 and negative accuracy 99.84. Even if there is mismatch in positive negative pairs (which I dont think there is as in both cases using conceptual captions and foil was set that the positive pair is labeled 0 and mismatching pair is labeled 1), there is a gap in classification performance. Again I am not training anything and yet I am obtaining very different results for the same task. Also it doesnt go right with me that the posit accuracy would be that low, as foil captions have only I word changed in their description of image so it is usually hard for models to see that the pair is not a match. Here it looks like it mostly predicts that the pair is not a match.

e-bug commented 2 years ago

Could it be because their model was pretrained on 12 tasks and datasets rather than just on Conceptual Captions?

ivana-13 commented 2 years ago

No, that is not the problem as pretraining is not done on 12 tasks. It took me so much time to figure it out but I am 99% sure that the difference is caused by the different dataset. More specifically, you used different object detector algorithm than it is suggested in the vilbert repository. That is why the data I downloaded there has bad performance on your weights. But to be true, you have many things done differently in your code to what was done in the vilbert code. Starting with different tokenizer and continuing with different initialization of weights. The problem with zero biases in vilbert was actually not a problem I realized it just after it was caused by the different names of layers in those two codes.

e-bug commented 2 years ago

Thanks a lot! Yes, they indeed used a ResNeXT-152 backbone in 12-in-1, while we followed the original ViLBERT paper with ResNet-101.

Note that the two models are equivalent (including the tokenizer). It's just different naming for the layers so that they would reflect the framework in our paper. You should be able to convert their weights into VOLTA 1:1.