Possible bugs in evaluation script in SubTask #1

hxssgaa commented 4 years ago

Hi, I think there is a bug in the evaluation script in the SubTask #1 located at evaluate_action_prediction function in the action_evaluation.py. Please have a look at the following code:

def evaluate_action_prediction(gt_actions, model_actions):
    ....
    # Case 1: Action mismatch -- record False for all attributes.
    if not action_match:
         for _ in supervision.keys():
              matches["attributes"].append(False)
     # Case 2: Action matches -- use model predictions for attributes.
     else:
         for key in supervision.keys():
             if key in IGNORE_ATTRIBUTES:
                 continue
             gt_key_vals = supervision[key]
             model_key_vals = round_datum["attributes"][key]
     .....

When evaluate in the furniture dataset, if action mismatch, it will loop all the gold attribute keys including the ignored attributes and append False in matches["attributes"]. Since some attribute keys are already ignored when you found a match in the action, so I think you continue the case when condition key in IGNORE_ATTRIBUTES is satisfied during the loop when action mismatch.

Another thing I want to point out is: is the accuracy of attribute in SubTask#1 measured by accuracy per attribute rather than per conversation really meaningful? For example, there are two conversations, one conversation has 1 action and 7 attributes, one conversation has 1 action and 1 attribute, assume the model predicts the action and one of attribute in the first conversation correctly and the action and attribute in the second conversation correctly, in the current evaluation method, the accuracy of attributes should be: 2 / 8 = 0.25. However based on the per conversation evaluation, it should be ((1/7) + 1) / 2 = 0.57 which makes more sense, because the current method favors conversations with more attributes. May I ask would you add the second evaluation metrics which are based on the per conversation level? Thanks.

satwikkottur commented 4 years ago

Hello @hxssgaa ,

Thanks for pointing out this issue. We realized this at our end as well and there is a pending PR (https://github.com/facebookresearch/simmc/pull/23) that fixes this issue.

Note that our evaluation is offline and per-round, i.e., predict the right action and its attributes given the golden history for the current round. Aggregating it by dialog might not truly reflect this because there is a good diversity in the dialog length, frequency of actions with multiple attributes, etc. Let me know your thoughts.

hxssgaa commented 4 years ago

Thanks for the answer, I think it's fine if you pay attention to the diversity for different dialog lengths. I will close this issue.

facebookresearch / simmc

Possible bugs in evaluation script in SubTask #1 #24