Open jimmy-red opened 3 years ago
It is very thorough of you.
Yes, it should be considered whether to use ground truth or generated belief state to query the DB results for the end-to-end setting. If we use the generated BS to query DBs, which would be more realistic that way, the performance of UBAR in the end-to-end setting would drop a little to a combined score of ~ 103.
Currently, I am not thinking of updating the scores in the paper though it is probably the right thing to do.
Fixed the code and will update the results.
We use the generated BS to query DB results and here are the results in end-to-end setting on WM 2.0 inform 91.5 success 77.4 bleu 17.0 score 101.5
We use the generated BS to query DB results and here are the results in end-to-end setting on WM 2.0 inform 91.5 success 77.4 bleu 17.0 score 101.5
so which model do you use?
We use the generated BS to query DB results and here are the results in end-to-end setting on WM 2.0 inform 91.5 success 77.4 bleu 17.0 score 101.5
so which model do you use?
I have the same question too, I have used my own training model which is 60th epoch and loss is 0.54. But the score I have got is inform 88.7 success 75.2 bleu 14.61 score 96.56. Could the author provide the checkpoint for reproducing the results mentioned above?
Hi, thanks for the wonderful work.
I am trying to understand the evaluation script, but some logic of the code confuse me a lot.
In https://github.com/TonyNemo/UBAR-MultiWOZ/blob/3f317e95e4e1e82ddf14f039bad1b8df6373fc2c/eval.py#L571
This condition will always be False (since cfg.use_true_bspn_for_ctr_eval=True) and
bspn = turn['bspn']
, which means you use the golden truth to query the DBs, instead of the generated one.It is strange that you use the golden dialog state to query the DBs, and compare it with the result queried by the goal of this dialogue.
Best. :)