facebookresearch / simmc

With the aim of building next generation virtual assistants that can handle multimodal inputs and perform multimodal actions, we introduce two new datasets (both in the virtual shopping domain), the annotation schema, the core technical tasks, and the baseline models. The code for the baselines and the datasets will be opensourced.
Other
131 stars 36 forks source link

Question about the new evaluation method for Task 1&2 #43

Closed sjlee2016 closed 4 years ago

sjlee2016 commented 4 years ago

Hi. I've noticed that attribute accuracy for action prediction is very low for fashion baseline model I know that there is a new parameter, single_round_eval added to the updated evaluation script for task 1 and 2(mm_action_prediction). if single_round_eval and round_id != num_gt_rounds - 1: continue And when single_round_eval is True, it only evaluates the last round for each dialog. but most last round of every dialog's API is "None" or "AddToCart", which does not have any attributes so it leaves supervision to be None most of the time. `supervision = gt_datum["action_supervision"]

        if supervision is not None and "args" in supervision:
            supervision = supervision["args"]
        if supervision is None:
            skipped += 1
            continue`

I've counted the number of times the evaluation skips because supervision is None and it was 973 times for fashion domain for dev test. Hence, for fashion domain, only 982-973 rounds are being evaluated. I believe this is the reason why attribute_accuracy is very low for the updated evaluation script. I want to check if this is how it is supposed to be or it needs to be fixed.

satwikkottur commented 4 years ago

Hello @sjlee2016 ,

The private file provided for devtest in the teststd format (fashion_devtest_dials_api_calls_teststd_format_private.json) contains various rounds sampled for evaluation, and not simply the last round. Could you please run the action evaluation script with the flag --action_json_path pointing to this file?

Feel free to get back if this does not address your concern.