dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Apache License 2.0
1.41k stars 208 forks source link

Question about ITM pretraining #52

Open EagleW opened 2 years ago

EagleW commented 2 years ago

Hi, @dandelin

I have some questions about ITM pre-training. For the pretraining ITM, how did you use itm loss and wpa loss? It seems that you use them separately: https://github.com/dandelin/ViLT/blob/762fd3975c180db6fc88f577cf39549983fa373a/vilt/modules/vilt_utils.py#L127-L139

Why not simply add up those two losses and backpropagate them together? https://github.com/dandelin/ViLT/blob/762fd3975c180db6fc88f577cf39549983fa373a/vilt/modules/objectives.py#L252-L272

I also have the same question as #48

Thank you!