Closed harukaza closed 3 years ago
A1: A sequence of word ids. The caption is processed by the tokenizer of Lxmert( which is identical to BERT tokenizer). This tokenzer can transfer the caption into a sequence of word ids which is (a part of) the input of LXMERT. You can refer https://huggingface.co/transformers/model_doc/lxmert.html?highlight=lxmert for more implementation detail of LXMERT. A2 : Yes.
I confuse about the embedding in your paper. LXMERT separately encodes image and caption text in two streams in paper 3.2.3. 1. The processed caption are word or word embedding?