ChenRocks / UNITER

Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"
https://arxiv.org/abs/1909.11740
781 stars 109 forks source link

RefCOCO training / evaluation details #51

Closed j-min closed 3 years ago

j-min commented 3 years ago

Hello, I have some questions regarding RefCOCO/+/g training / evaluation details.

  1. Are you going to upload RefCOCO/+/g training/evaluation codes?
  2. Which boxes did you finetune UNITER on?
  3. Which boxes did you use to evaluate on val, test, val^d, and test^d evaluation respectively? Did you use Mask R-CNN boxes from MattNet?

Table from UNITER image

It seems ViLBERT-MT authors finetuned their model on 100 BUTD boxes + Mask R-CNN boxes from MattNet-> code. Then they used 100 BUTD boxes during evaluation -> code

I calculated oracle scores on RefCOCOg val split: "if there exists a candidate box with iou(candidate,target) > 0.5 => correct"

Mask R-CNN boxes from MAttNet -> 86.10% MS COCO GT boxes -> 99.6% VilBERT-MT's 100 BUTD boxes on RefCOCOg -> 96.53%

Since BUTD boxes have better coverage on Mask R-CNN boxes from MAttNet, I don't think this is fair comparison to MattNet. Also this is not consistent with the ViLBERT-MT paper.

Paragraph from ViLBERT-MT image

ViLBERT-MT authors compared ViLBERT-MT and UNITER on test^d. I wonder which boxes you used for UNITER finetuning and evaluation.

Table from ViLBERT-MT image

lichengunc commented 3 years ago

We finetuned on ground-truth (COCO's) annotated boxes whose features are extracted using butd, and ran inference on 1) ground-truth boxes 2) mattnet's detected boxes

j-min commented 3 years ago

Thank you for the clarification!