e-bug / volta

[TACL 2021] Code and data for the framework in "Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs"
https://aclanthology.org/2021.tacl-1.58/
MIT License
114 stars 24 forks source link

Is it possible to provide a pretrained LXMERT model and fine-tune it with your code on RefCOCO? #6

Closed yonatanbitton closed 3 years ago

yonatanbitton commented 3 years ago

Hello. I have a pretrained two LXMERT models, using the official LXMERT GitHub repository. I want to evaluate my models on RefCOCO. I was wondering if it is possible to use your implementation to fine-tune in on the RefCOCO task?

I do not want to change to pre-training, just to compare these two models on RefCOCO. My models are stored in model_LXRT.pth files (same as LXMERT implementation).

Thanks!

e-bug commented 3 years ago

Hi! Yes, you should be able to load LXMERT in here and then fine-tune on RefCOCO. To do so, you'd need to map the state dict of model_LXRT.pth into the one of this repo. You can have a look at our LXMERT checkpoint to see the corresponding layer names. Just note that I used Transformer sub-layers as "layers" in this repo (see here).

If you write a script that maps the checkpoint from the official LXMERT repository onto VOLTA, do send a PR! :)

yonatanbitton commented 3 years ago

Hey. I am on it. Your repository is simple to use - great resource, thanks.

First, I tried to provide the model as a pretrained checkpoint. The training actually converged:

python train_task.py --bert_model bert-base-uncased --config_file config/ctrl_lxmert.json --from_pretrained /data/users/yonatab/lxmert/snap/pretrained/model_LXRT.pth --tasks_config_file config_tasks/ctrl_trainval_tasks.yml --task 10 --adam_epsilon 1e-6 --adam_betas 0.9 0.999 --adam_correct_bias --weight_decay 0.0001 --warmup_proportion 0.1 --clip_grad_norm 1.0 --output_dir checkpoints/refcoco+_unc/ctrl_lxmert_original_lxmert --logdir logs/refcoco+_unc 

In evaluation, with the CTRL LXMERT I reached 71.27, and with the way performed here I reach 66.86.

Now, I want to reduce the gap between the implementations.

I'm not sure how to perform the mapping.

I am looking on the LXMERT CTRL state dict (contains 515 layers), your LXMERT checkoint (which has 517 layers) and the original LXMERT state dict (contains 473 layers).

I understand that this changes are needed:

  1. remove the module prefix
  2. replace attention.self to attention_self
  3. replace attention.output to attention_output

This changes reduces the different layers to 408 and 450 but it doesn't seem to be enough and I am not sure on how to proceed.

Do you have anything to elaborate on this subject?

Thanks!

e-bug commented 3 years ago

Thanks!

Yes,

  1. is a by-product of parallel training
  2. this is also correct

One major difference I can recall from the original LXMERT is that we LXMERT CTRL using only MRC-KL as the visual loss, while our LXMERT checkpoint will also have the weights for the XENT and regression losses (https://github.com/e-bug/volta/blob/main/config/lxmert.json#L19). They however also had a VQA task, which we didn't include due to the pretraining data we used.

I'd recommend the (perhaps painful) process of trying to match them layer by layer and try to see where they diverge. Maybe, simply printing them via Python's zip(original_dict, volta_dict) might be a first step.

Keep me posted on how it goes!

yonatanbitton commented 3 years ago

Hey, I have a partial mapping.

First, without any mapping, just loading the model_LXRT.pthof LXMERT works (nice 👍 ). It reaches 66.862 on RefCOCO. With the propsed code, it reaches 69.836. Your CTRL LXMERT reaches 71.268. Close enough for me.

In utils.py file, after this line state_dict = torch.load(resolved_archive_file, map_location="cpu") :

Adding this code is partially mapping the layers from original LXMERT model.

It's basically just removing the "module." prefix, and mapping attention.self to attention_self, and attention.output to attention_output.

state_dict = torch.load(resolved_archive_file, map_location="cpu")
### CHANGING STATE DICT - LXMERT
if '.pth' in resolved_archive_file:
    print(f"*** CHANGING STATE DICT ***")
    def remove_starting_module(x):
        return ".".join(x.split('.')[1:])

    import collections
    new_state_dict = collections.OrderedDict([(remove_starting_module(k), v) for k, v in state_dict.items()])
    new_state_dict = collections.OrderedDict([('attention_self', v) if 'attention.self' in k else (k, v) for k, v in new_state_dict.items()])
    new_state_dict = collections.OrderedDict([('attention_output', v) if 'attention.output' in k else (k, v) for k, v in new_state_dict.items()])
    state_dict = new_state_dict

Thanks for the help

e-bug commented 3 years ago

Nice, thanks!

Did you figure out which layers didn't match?

yonatanbitton commented 3 years ago

The names of the layers do not match (intersection of layer names is 0) Some layers seem to be similar, for example the 3 changes I've made. It is possible that further matching can be made.