Closed hxssgaa closed 4 years ago
Hello @hxssgaa ,
Thanks for pointing out this issue. We realized this at our end as well and there is a pending PR (https://github.com/facebookresearch/simmc/pull/23) that fixes this issue.
Note that our evaluation is offline and per-round, i.e., predict the right action and its attributes given the golden history for the current round. Aggregating it by dialog might not truly reflect this because there is a good diversity in the dialog length, frequency of actions with multiple attributes, etc. Let me know your thoughts.
Thanks for the answer, I think it's fine if you pay attention to the diversity for different dialog lengths. I will close this issue.
Hi, I think there is a bug in the evaluation script in the
SubTask #1
located atevaluate_action_prediction
function in theaction_evaluation.py
. Please have a look at the following code:When evaluate in the furniture dataset, if action mismatch, it will loop all the gold attribute keys including the ignored attributes and append
False
inmatches["attributes"]
. Since some attribute keys are already ignored when you found a match in the action, so I think you continue the case when conditionkey in IGNORE_ATTRIBUTES
is satisfied during the loop when action mismatch.Another thing I want to point out is: is the accuracy of attribute in SubTask#1 measured by accuracy per attribute rather than per conversation really meaningful? For example, there are two conversations, one conversation has 1 action and 7 attributes, one conversation has 1 action and 1 attribute, assume the model predicts the action and one of attribute in the first conversation correctly and the action and attribute in the second conversation correctly, in the current evaluation method, the accuracy of attributes should be: 2 / 8 = 0.25. However based on the per conversation evaluation, it should be ((1/7) + 1) / 2 = 0.57 which makes more sense, because the current method favors conversations with more attributes. May I ask would you add the second evaluation metrics which are based on the per conversation level? Thanks.