Relationship evaluation and reported paper results

kochsebastian commented 1 year ago

I have a question regarding your evaluation code for the relationship metric. More specifically, how you handle GT None edges/relationships.

I am not sure if this snippet in your code is entirely correct: https://github.com/ShunChengWu/3DSSG/blob/master/utils/util_eva.py#L159-L174

if len(gt_r) == 0:
    # Ground truth is None
    indices = torch.where(sorted_conf_matrix < threshold)[0]
    if len(indices) == 0:
        index = maxk+1
    else:
        index = sorted(indices)[0].item()+1
    temp_topk.append(index)
for predicate in gt_r: # for the multi rel case
    gt_conf = conf_matrix[gt_s, gt_t, predicate]
    indices = torch.where(sorted_conf_matrix == gt_conf)[0]
    if len(indices) == 0:
        index = maxk+1
    else:
        index = sorted(indices)[0].item()+1
    temp_topk.append(index)

So if the GT edge/predicate is None which is equal to gt_r is empty, then you have a separate evaluation where you only check if your top predictions are below your threshold. However, this means you're not evaluating if the object nodes are correct. I guess you just evaluate that you are predicting no predicate. However, this is not really in the spirit of the relationship metric, right? Maybe this produces better results than the method actually can provide? I think to evaluate the triplet correctly, you should still evaluate if the object nodes are predicted correctly.

Am I missing anything in your evaluation which justifies the evaluation procedure, or is this indeed slightly incorrect?

ShunChengWu commented 1 year ago

There are three parts of evaluation. One is relationship triplet, one is object, and one is only the predicate. For the triplet one we consider both nodes as well as the predicate.

kochsebastian commented 1 year ago

Yes, this is clear to me.

However, this snippet is from the relationship triplet evaluation. And when the GT predicate is not None then you evaluate the triplet of (subject, object, predicate). This is correct in my opinion.

But, when there is no GT predicate, you only evaluate the scores with the threshold. I believe this simplifies the evaluation quite a lot, because for this edge in the graph, you only check if other combined scores are below a threshold.
For me this does not evaluate the triplet (subject,object,None) but more like (any subject, any object, None).

Is this more clear what I mean? Do you agree? Or am I missing something?

ShunChengWu commented 1 year ago

Yes, you are right. That indeed simplified the evaluation a lot. We followed the same metric as the previous paper (3DSSG). Personally, I think that is not the best metric to measure performance.

kochsebastian commented 1 year ago

Okay thank you for the clarification.

ShunChengWu / 3DSSG

Relationship evaluation and reported paper results #35