Shikib / fed

Code for SIGdial 2020 paper: Unsupervised Evaluation of Interactive Dialog with DialoGPT (https://arxiv.org/abs/2006.12719)
27 stars 5 forks source link

Cannot interpret fed results #4

Open Youmna-H opened 1 year ago

Youmna-H commented 1 year ago

Hello, I've run the example provided in fed_demo.py and I am finding difficulty in interpreting the results. The scores I got are:

{'interesting': -0.28983132044474313, 'engaging': -0.40840943654378226, 'specific': -0.22960980733235647, 'relevant': 7.028880596160889, 'correct': 7.123448371887207, 'semantically appropriate': 0.2597320079803467, 'understandable': 0.21886169910430908, 'fluent': 0.23042782147725394, 'coherent': 7.030221144358317, 'error recovery': 6.849422454833984, 'consistent': 7.3398823738098145, 'diverse': 7.251625696818034, 'depth': 7.140579700469971, 'likeable': -0.23120896021525006, 'understand': 7.056127548217773, 'flexible': -0.09475564956665039, 'informative': -0.16989962259928415, 'inquisitive': -0.34922027587890625}

In the paper, all the scores are in the ranges: 1-3, 0-1, or 1-5. For the above scores, I don't know what is the upper and lower bounds. Also in fed.py, the score is calculated by: scores[metric] = (low_score - high_score) Does this mean that a negative score is a good thing? Please advise on how to interpret these scores.

ShreyaPrabhu commented 1 year ago

Hello,

I am working on a dialog system that requires reference free metric. I am facing similar issue. I am not able to interpret the metrics being generated. Was this issue resolved? Any help would be appreciated.

demon-ling commented 1 year ago

Do I have the same question? Have you solved it yet?