facebookresearch / simmc

With the aim of building next generation virtual assistants that can handle multimodal inputs and perform multimodal actions, we introduce two new datasets (both in the virtual shopping domain), the annotation schema, the core technical tasks, and the baseline models. The code for the baselines and the datasets will be opensourced.
Other
131 stars 36 forks source link

Possible bugs in evaluation script in SubTask #1 #24

Closed hxssgaa closed 4 years ago

hxssgaa commented 4 years ago

Hi, I think there is a bug in the evaluation script in the SubTask #1 located at evaluate_action_prediction function in the action_evaluation.py. Please have a look at the following code:

def evaluate_action_prediction(gt_actions, model_actions):
    ....
    # Case 1: Action mismatch -- record False for all attributes.
    if not action_match:
         for _ in supervision.keys():
              matches["attributes"].append(False)
     # Case 2: Action matches -- use model predictions for attributes.
     else:
         for key in supervision.keys():
             if key in IGNORE_ATTRIBUTES:
                 continue
             gt_key_vals = supervision[key]
             model_key_vals = round_datum["attributes"][key]
     .....

When evaluate in the furniture dataset, if action mismatch, it will loop all the gold attribute keys including the ignored attributes and append False in matches["attributes"]. Since some attribute keys are already ignored when you found a match in the action, so I think you continue the case when condition key in IGNORE_ATTRIBUTES is satisfied during the loop when action mismatch.

Another thing I want to point out is: is the accuracy of attribute in SubTask#1 measured by accuracy per attribute rather than per conversation really meaningful? For example, there are two conversations, one conversation has 1 action and 7 attributes, one conversation has 1 action and 1 attribute, assume the model predicts the action and one of attribute in the first conversation correctly and the action and attribute in the second conversation correctly, in the current evaluation method, the accuracy of attributes should be: 2 / 8 = 0.25. However based on the per conversation evaluation, it should be ((1/7) + 1) / 2 = 0.57 which makes more sense, because the current method favors conversations with more attributes. May I ask would you add the second evaluation metrics which are based on the per conversation level? Thanks.

satwikkottur commented 4 years ago

Hello @hxssgaa ,

Thanks for pointing out this issue. We realized this at our end as well and there is a pending PR (https://github.com/facebookresearch/simmc/pull/23) that fixes this issue.

Note that our evaluation is offline and per-round, i.e., predict the right action and its attributes given the golden history for the current round. Aggregating it by dialog might not truly reflect this because there is a good diversity in the dialog length, frequency of actions with multiple attributes, etc. Let me know your thoughts.

hxssgaa commented 4 years ago

Thanks for the answer, I think it's fine if you pay attention to the diversity for different dialog lengths. I will close this issue.