e-bug / volta

[TACL 2021] Code and data for the framework in "Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs"
https://aclanthology.org/2021.tacl-1.58/
MIT License
114 stars 24 forks source link

visual location encoding in UNITER #18

Closed PaulLerner closed 2 years ago

PaulLerner commented 2 years ago

Hi,

I notice that you simply project the object location here https://github.com/e-bug/volta/blob/main/volta/embeddings.py#L495 and that you set the object location dimension to 5 there https://github.com/e-bug/volta/blob/main/config/ctrl_uniter_base.json#L16

How exactly do you represent the location of the object? Chen et al. say they use a 7 dimensional vector: [x_1, y_1, x_2, y_2, w, h, w ∗ h] (normalized top/left/bottom/right coordinates, width, height, and area.) They hard-code it: https://github.com/ChenRocks/UNITER/blob/master/model/model.py#L254

Bests,

Paul

e-bug commented 2 years ago

Hi!

In our controlled setup, we used 5 locations for all the models to do apples-to-apples comparisons.

The corresponding locations are the normalised top/left/bottom/right coordinates, and area. You can see their computations here.

PaulLerner commented 2 years ago

Very clear, thanks!