Closed dhecloud closed 4 years ago
Thanks for your question. Ultimately the only thing that really makes sense is to use the scores in a relative manner (e.g., to compare between two systems). This would be equivalent to computing FED scores for a baseline system, defining that as "0", and then evaluating all future systems relative to the baseline system (e.g., +X above the baseline). You could theoretically set an upper bound in a similar way by using the FED scores for human dialogs.
Great. Thanks for your reply!
Hi! Thank you for your interesting research and releasing the code. How would you suggest to interpret the scores of FED metric? Since there aren't any positive follow up utterances for some metrics (relevant, correct), The scores/loss for that metric would always be positive. What would be a good gauge to determine a good score and a bad score?