On the visual token added to linguistic tokens in VLBertEmbeddings class

iki-taichi commented 3 years ago

Hello.

I have a question about the VLBertEmbeddings class.

In its forward function, a global image feature is added into linguistic tokens The last token in vision sequence is used as the global image feature like bellow:

https://github.com/e-bug/volta/blob/9e5202141920600d58a9c5c17519ca453795d65d/volta/embeddings.py#L271

Using the last token seems reasonable for the original VLBert (vl-bert_base.json) because add_global_imgfeat is last, but I think this should be the first token for the controlled VLBert (ctrl_vl-bert_base.json), whose add_global_imgfeat is first.

Are there any reason that the last token is always used in the class?

I'm sorry if I misunderstand the way the embeddings classes work.

Thanks.

e-bug commented 3 years ago

Hi Iki-san,

Thanks for pointing this out, you are right.

I don't think this impacts performance much but I'll try to fix it. I cannot however just fix the code as it will affect the controlled VL-BERT model that we released.

So, I'll need to find some time and resources to pre-train the controlled VL-BERT again.

I'll keep this issue open until I do so.

If you are pre-training VL-BERT, go ahead and fix the indexing problem :)

iki-taichi commented 3 years ago

Thank you for your kind answer. I agree with you. Although I'm curious about its impact, considering the cost of the pre-training, I do not think it is urgent to fix it.

As for me, I'm not able to do the pre-training due to lack of resources :_(

e-bug / volta

On the visual token added to linguistic tokens in VLBertEmbeddings class #10