TonyNemo / UBAR-MultiWOZ

AAAI 2021: "UBAR: Towards Fully End-to-End Task-Oriented Dialog System with GPT-2"
96 stars 25 forks source link

The context used in evaluation #7

Open 311dada opened 3 years ago

311dada commented 3 years ago

The dialogue context should be in oracle or generated ?

TonyNemo commented 3 years ago

In a realistic setting, the context should be generated.

311dada commented 3 years ago

But there is something strange that happened in the evaluation with your given checkpoint. Specifically, in the Response Generation setup, I get the same result as issue #4. The weird thing is I get a somewhat lower result after giving the mode gold context (gold response). In particular, I use the following command

python train.py -mode test -cfg eval_load_path=$path use_true_prev_bspn=True use_true_prev_aspn=True use_true_db_pointer=True use_true_prev_resp=True use_true_curr_bspn=True use_true_curr_aspn=True use_all_previous_context=True cuda_device=0

And I get a result of match: 96.10 success: 90.80 bleu: 22.06 score: 115.51 on the test split. Am I wrong? Could you explain to me? Thanks for your help!

SkyAndCloud commented 3 years ago

The same question. @TonyNemo