Hello, I wanted to check with you if the shapes of my inputs and the output I get are correct (using the entry.py). For a single example whose tokenized sentence length is 30, the shape of input_ids, input_mask and segment_ids is (1,30) and the image features (features and boxes) have (1,36,2048) and (1,36,4), respectively. As the cross-modal output, I get (1,768). Would these be correct? How can I get un-pooled outputs?
Hello, I wanted to check with you if the shapes of my inputs and the output I get are correct (using the entry.py). For a single example whose tokenized sentence length is 30, the shape of input_ids, input_mask and segment_ids is (1,30) and the image features (features and boxes) have (1,36,2048) and (1,36,4), respectively. As the cross-modal output, I get (1,768). Would these be correct? How can I get un-pooled outputs?
Best regards