Shikib / fed

Code for SIGdial 2020 paper: Unsupervised Evaluation of Interactive Dialog with DialoGPT (https://arxiv.org/abs/2006.12719)
28 stars 5 forks source link

Interpreting scores from FED metric #1

Closed dhecloud closed 4 years ago

dhecloud commented 4 years ago

Hi! Thank you for your interesting research and releasing the code. How would you suggest to interpret the scores of FED metric? Since there aren't any positive follow up utterances for some metrics (relevant, correct), The scores/loss for that metric would always be positive. What would be a good gauge to determine a good score and a bad score?

Shikib commented 4 years ago

Thanks for your question. Ultimately the only thing that really makes sense is to use the scores in a relative manner (e.g., to compare between two systems). This would be equivalent to computing FED scores for a baseline system, defining that as "0", and then evaluating all future systems relative to the baseline system (e.g., +X above the baseline). You could theoretically set an upper bound in a similar way by using the FED scores for human dialogs.

dhecloud commented 4 years ago

Great. Thanks for your reply!