Open nizhf opened 2 years ago
Hi @nizhf thanks for your interest in our work! I think our vidor_eval.ipynb
indeed supports multi-label evaluation. We loop through all the predicted HOI triplets, and when there's a match, we append the specific triplet_class
to gt_bbox_pair_matched
. Note that it's possible that there're more than one triplet with the same subject and object in the predicted HOI triplets.
I think what you append to gt_bbox_pair_matched
is the index of the gt_pair
. In gt_bbox_pair_matched.add(max_gt_id)
, the max_gt_id
is set as max_gt_id = k
, and k
is from this line for k, gt_bbox_pair_id in enumerate(gt_bbox_pair_ids)
, which is the index of the gt_bbox_pair_id
, but not a triplet.
Let me clarify: the idea of evaluation is:
for each predicted HOI triplet
for each ground truth HOI triplet (k)
if there's a match
set is_match to True
record k or update with the maximum overlapping object-pair boxes
if there's a match
add the matched, predicted HOI triplet into the true positive set
else
add into the false positive set
As the ground truth HOI triplets are multi-label, the predictions also can match them.
Thank you for detailed clarification.
What confuses me is for each ground truth HOI triplet (k)
. In the evaluation script, it refers to for k, gt_bbox_pair_id in enumerate(gt_bbox_pair_ids)
, and gt_bbox_pair_ids = result['gt_bbox_pair_ids']
.
I checked the result JSON file, gt_bbox_pair_ids
are for example 'gt_bbox_pair_ids': [[0, 1], [1, 0]]
. If I understand correctly, these point to the index of gt_boxes
. So maybe here is only for each ground-truth pair (k)
? The ground truth HOI triplet
is obtained by gt_rel_cls = result['gt_action_labels'][k][j]
. If there is a match, the ground-truth pair k
is added to gt_bbox_pair_matched
. This pair then cannot be matched to other predicted triplet.
Just a detailed example:
Assume we have two predicted HOIs: <human1, watch, obj2>
and <human1, next_to, obj2>
. The gt_bbox_pair_ids
is [[0, 1]]
. The gt_action_labels
has 1.0
for watch
and next_to
.
We first process prediction <human1, watch, obj2>
. We have k=0
and j=index_of_watch
. Then we have result['gt_action_labels'][k][j]=1.0
. This is a match, we add <human1, watch, obj2>
to tp
and k=0
to gt_bbox_pair_matched
.
Then we process prediction <human1, next_to, obj2>
. We have k=0
and j=index_of_next_to
. We also have result['gt_action_labels'][k][j]=1.0
. There should be a match, but we check that k=0
is already in gt_bbox_pair_matched
, so <human1, watch, obj2>
is falsely added to fp
.
I hope I described my understanding of the vidor_eval_ipynb
script clearly.
Hi. I have a question when I using the vidor_eval.ipynb script to generate mAP. The script seems to only support single-label case? If a ground-truth human-object pair has multiple interactions, for example gt <human1, (watch, next to), obj2>, only <human1, watch, obj2> can be matched to a prediction. This gt pair <human1, obj2> is then added to
gt_bbox_pair_matched
and cannot be matched to other predictions. Thank you