Closed zhongpeixiang closed 4 years ago
Hi! Those are human ratings of empathy, relevance, and understandability for the conversations themselves in the dataset. We don't show any results about that in our paper.
Thank you for the quick and helpful response!
Can you please explain which ratings refer to empathy,relevance and understandability? Why we have 2 different ratings?
I believe they're in that order (empathy, relevance, and understandability), but I'm not positive because these ratings were collected before I came onto this part of the project. Similarly, my understanding is that these are ratings from two different people per line, separated by an underscore.
So, if I understand correctly, they collected two different ratings per line from MTurk workers, but how do they collect Table 8: "Human evaluation metrics from rating task for additional experiments" in the paper? It seems there are only 3 ratings described in that paper.
Hi @anar-rzayev , Table 8 is where we did human evaluations of the models themselves, not human evaluations of the dataset.
What's the interpretation of the
selfeval
field in the dataset?For example, what's the meaning of
4|3|4_3|5|5
inThanks