Question about the end-to-end evaluation

TonyNemo / UBAR-MultiWOZ

AAAI 2021: "UBAR: Towards Fully End-to-End Task-Oriented Dialog System with GPT-2"

96 stars 25 forks source link

Question about the end-to-end evaluation #3

Open jimmy-red opened 3 years ago

jimmy-red commented 3 years ago

Hi, thanks for the wonderful work.

I am trying to understand the evaluation script, but some logic of the code confuse me a lot.

In https://github.com/TonyNemo/UBAR-MultiWOZ/blob/3f317e95e4e1e82ddf14f039bad1b8df6373fc2c/eval.py#L571

This condition will always be False (since cfg.use_true_bspn_for_ctr_eval=True) and bspn = turn['bspn'], which means you use the golden truth to query the DBs, instead of the generated one.

It is strange that you use the golden dialog state to query the DBs, and compare it with the result queried by the goal of this dialogue.

Best. :)

TonyNemo commented 3 years ago

It is very thorough of you.

Yes, it should be considered whether to use ground truth or generated belief state to query the DB results for the end-to-end setting. If we use the generated BS to query DBs, which would be more realistic that way, the performance of UBAR in the end-to-end setting would drop a little to a combined score of ~ 103.

Currently, I am not thinking of updating the scores in the paper though it is probably the right thing to do.

TonyNemo commented 3 years ago

Fixed the code and will update the results.

TonyNemo commented 3 years ago

We use the generated BS to query DB results and here are the results in end-to-end setting on WM 2.0 inform 91.5 success 77.4 bleu 17.0 score 101.5

newcolour1994 commented 3 years ago

We use the generated BS to query DB results and here are the results in end-to-end setting on WM 2.0 inform 91.5 success 77.4 bleu 17.0 score 101.5

so which model do you use?

comprehensiveMap commented 3 years ago

We use the generated BS to query DB results and here are the results in end-to-end setting on WM 2.0 inform 91.5 success 77.4 bleu 17.0 score 101.5

so which model do you use?

I have the same question too, I have used my own training model which is 60th epoch and loss is 0.54. But the score I have got is inform 88.7 success 75.2 bleu 14.61 score 96.56. Could the author provide the checkpoint for reproducing the results mentioned above?