JialianW / GRiT

GRiT: A Generative Region-to-text Transformer for Object Understanding (https://arxiv.org/abs/2212.00280)
MIT License
293 stars 30 forks source link

Poor result in Densecap Evaluation #7

Closed Wykay closed 1 year ago

Wykay commented 1 year ago

Hello,

I am trying to use the results produced by the provided checkpoint of densecap to evaluate on VG, and after replacing logprobs(confidence/score), box, captions in addResult() as well as idx_to_token, vocab_size in model, in densecap/eval_utils.lua, I got a mAP result of 0.000609. I found that the number of 'ok=1' is very small, meaning few ground truth are assigned to predictions. Seems like I have done something wrong.

I combined GRiT's boxes, descriptions and score predictions of an image together, and fed them into addResult() per image in densecap, but I got a reletively low mAP and I found that the IOU between ground truth and prediction boxes were very small, could you please tell me what I am wrong with? Thank you!

Here is the process of replacement:

` while true do ------- single image ------ counter = counter + 1

-- Grab a batch of data and convert it to the right dtype            batch_size = 1
local loader_kwargs = {split=split, iterate=true}
local img, gt_boxes, gt_labels, info, _ = loader:getBatch(loader_kwargs)
info = info[1]     

-- fine the index of corresponding preditions, the indexs of image_id,box,score,descriptions are same in the same image
for index, v in ipairs(my_results.image_id) do
    if tostring(v) == string.gsub(info.filename, '.jpg', '') then
         index_ = index
         print(index_)
    end
end

assert(string.gsub(info.filename, '.jpg', '') == tostring(my_results.image_id[index_]) )

-- replace these with the predictions of the corrsponding image in GRiT
local boxes, logprobs, captions = my_results.box[index_], my_results.score[index_], my_results.descriptions[index_]
local boxes, logprobs = torch.Tensor(boxes), torch.Tensor(logprobs)
local gt_captions = model.nets.language_model:decodeSequence(gt_labels[1])   -- seq: tensor of shape N x T    id_to_tokens      bs = 1

evaluator:addResult(logprobs, boxes, captions, gt_boxes[1], gt_captions)`
JialianW commented 1 year ago

Densecap evaluates the results in their box coordinates system. For example, we make change to the output boxes to adapt their coordinates as shown here: https://github.com/JialianW/GRiT/blob/39b33dbc0900e4be0458af14597fcb1a82d933bb/grit/evaluation/eval.py#L100

Did you save results which go through the above code? If not, please follow it to rescale the box.

Wykay commented 1 year ago

Line 100 in 39b33db

Yes, I run the

python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_DenseCap.yaml --output-dir-name ./output/grit_b_densecap --eval-only MODEL.WEIGHTS models/grit_b_densecap.pth

command to get the json result of predictions on VG, which has gone through process() to resize the boxes.

It does process here

Wykay commented 1 year ago

Perhaps the performance drop is caused by this?

JialianW commented 1 year ago

The densecap repo should not have evaluation issues. We successfully obtain our results by only adding a read function.

Perhaps you didn't get it right because of the box coordinate format. Our saved json is in (lefttop_x, lefttop_y, w, h), while densecap uses (center_x, center_y, w, h). Please make sure to convert to densecap format and try again.

Apart from this, I think there are no other places that could make a difference in the results.

Wykay commented 1 year ago

The densecap repo should not have evaluation issues. We successfully obtain our results by only adding a read function.

Perhaps you didn't get it right because of the box coordinate format. Our saved json is in (lefttop_x, lefttop_y, w, h), while densecap uses (center_x, center_y, w, h). Please make sure to convert to densecap format and try again.

Apart from this, I think there are no other places that could make a difference in the results.

Thank you so much!

MarziEd commented 1 year ago

@Wykay @JialianW Hi I have a similar issue as @Wykay had before! I appreciate your help In your evaluation code, are you using denscap dataloader to get the ground truth bounding boxes and captions? What is the format of the ground truth bounding boxes in GRIT test.json and train.json files? Is it (x_topleft, y_topleft,w,h) or (xc,yc,w,h) or (x1y1,x2,y2)?