LuoweiZhou / VLP

Vision-Language Pre-training for Image Captioning and Question Answering
Apache License 2.0
411 stars 62 forks source link

Dimension of Position Embedding #34

Open HollowFire opened 3 years ago

HollowFire commented 3 years ago

The original paper describes the position embedding as a 5-dimension vector which contains the coordinates of the top left and bottom right corner plus a relative area. But in the data preprocessing script (seq2seq_loader.py), the variable _vispe, which I figure stands for the position embedding, has a dimension of 6. What causes the difference and what is the extra value used for?

LuoweiZhou commented 3 years ago

@HollowFire The last dimension in vis_pe stands for the confidence score of the proposal: https://github.com/LuoweiZhou/detectron-vlp/blob/master/tools/extract_features.py#L266