Closed sjlee2016 closed 4 years ago
Hello @sjlee2016 ,
The private file provided for devtest
in the teststd
format (fashion_devtest_dials_api_calls_teststd_format_private.json
) contains various rounds sampled for evaluation, and not simply the last round. Could you please run the action evaluation script with the flag --action_json_path
pointing to this file?
Feel free to get back if this does not address your concern.
Hi. I've noticed that attribute accuracy for action prediction is very low for fashion baseline model I know that there is a new parameter, single_round_eval added to the updated evaluation script for task 1 and 2(mm_action_prediction).
if single_round_eval and round_id != num_gt_rounds - 1: continue
And when single_round_eval is True, it only evaluates the last round for each dialog. but most last round of every dialog's API is "None" or "AddToCart", which does not have any attributes so it leaves supervision to be None most of the time. `supervision = gt_datum["action_supervision"]I've counted the number of times the evaluation skips because supervision is None and it was 973 times for fashion domain for dev test. Hence, for fashion domain, only 982-973 rounds are being evaluated. I believe this is the reason why attribute_accuracy is very low for the updated evaluation script. I want to check if this is how it is supposed to be or it needs to be fixed.