coldmanck / VidHOI

Official implementation of "ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos" (ACM ICMRW 2021)
https://dl.acm.org/doi/10.1145/3463944.3469097
Apache License 2.0
50 stars 12 forks source link

Evaluation only considering single-label? #9

Open nizhf opened 2 years ago

nizhf commented 2 years ago

Hi. I have a question when I using the vidor_eval.ipynb script to generate mAP. The script seems to only support single-label case? If a ground-truth human-object pair has multiple interactions, for example gt <human1, (watch, next to), obj2>, only <human1, watch, obj2> can be matched to a prediction. This gt pair <human1, obj2> is then added to gt_bbox_pair_matched and cannot be matched to other predictions. Thank you

coldmanck commented 2 years ago

Hi @nizhf thanks for your interest in our work! I think our vidor_eval.ipynb indeed supports multi-label evaluation. We loop through all the predicted HOI triplets, and when there's a match, we append the specific triplet_class to gt_bbox_pair_matched. Note that it's possible that there're more than one triplet with the same subject and object in the predicted HOI triplets.

nizhf commented 2 years ago

I think what you append to gt_bbox_pair_matched is the index of the gt_pair. In gt_bbox_pair_matched.add(max_gt_id), the max_gt_id is set as max_gt_id = k, and k is from this line for k, gt_bbox_pair_id in enumerate(gt_bbox_pair_ids), which is the index of the gt_bbox_pair_id, but not a triplet.

coldmanck commented 2 years ago

Let me clarify: the idea of evaluation is:

for each predicted HOI triplet
   for each ground truth HOI triplet (k)
      if there's a match
         set is_match to True
         record k or update with the maximum overlapping object-pair boxes
   if there's a match
      add the matched, predicted HOI triplet into the true positive set
   else
      add into the false positive set

As the ground truth HOI triplets are multi-label, the predictions also can match them.

nizhf commented 2 years ago

Thank you for detailed clarification.

What confuses me is for each ground truth HOI triplet (k). In the evaluation script, it refers to for k, gt_bbox_pair_id in enumerate(gt_bbox_pair_ids), and gt_bbox_pair_ids = result['gt_bbox_pair_ids'].

I checked the result JSON file, gt_bbox_pair_ids are for example 'gt_bbox_pair_ids': [[0, 1], [1, 0]]. If I understand correctly, these point to the index of gt_boxes. So maybe here is only for each ground-truth pair (k)? The ground truth HOI triplet is obtained by gt_rel_cls = result['gt_action_labels'][k][j]. If there is a match, the ground-truth pair k is added to gt_bbox_pair_matched. This pair then cannot be matched to other predicted triplet.

Just a detailed example: Assume we have two predicted HOIs: <human1, watch, obj2> and <human1, next_to, obj2>. The gt_bbox_pair_ids is [[0, 1]]. The gt_action_labels has 1.0 for watch and next_to. We first process prediction <human1, watch, obj2>. We have k=0 and j=index_of_watch. Then we have result['gt_action_labels'][k][j]=1.0. This is a match, we add <human1, watch, obj2> to tp and k=0 to gt_bbox_pair_matched. Then we process prediction <human1, next_to, obj2>. We have k=0 and j=index_of_next_to. We also have result['gt_action_labels'][k][j]=1.0. There should be a match, but we check that k=0 is already in gt_bbox_pair_matched, so <human1, watch, obj2> is falsely added to fp.

I hope I described my understanding of the vidor_eval_ipynb script clearly.