budzianowski / multiwoz

Source code for end-to-end dialogue model from the MultiWOZ paper (Budzianowski et al. 2018, EMNLP)
MIT License
852 stars 199 forks source link

The upper bound of Inform and Success rate? #20

Open yunhaoli1995 opened 4 years ago

yunhaoli1995 commented 4 years ago

I run evaluate.py and get Matches(inform): 90.40, Success 82.3. Are these the upper bound of metric Inform and Success? In some paper, the inform and success rate can exceed 90.40,82.3. In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much. 4e4b735bf38315fb2f608ab49a216ca

budzianowski commented 4 years ago

Hi, can you explain in more detailed what models you evaluated?

yunhaoli1995 commented 4 years ago

Sorry,i didn't speak clearly. I evaluated the data/test_dials, the ground truth using the script evaluate.py and get Matches(inform): 90.40, Success 82.3, i want to know are these the upper bound of metric inform and success? In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.

skiingpacman commented 4 years ago

I think there is a likely upper bound on Inform Rate of 91.6% on the MultiWOZ 2.0 test of set due to a combination of the implementation of Inform Rate and errors in the belief state in the test set. This is based on the metric internally using the test-set to provide the "oracle" belief state, when sampling venues that the policy presents.

In practice evaluating the test set dialogues themselves (as per @leeyunhao) I got min 90.3%, max 90.9%, mean 90.54% +/- 0.46% (+/- 2 * STD) using 5 samples.

For more details see comment: https://github.com/budzianowski/multiwoz/issues/2#issuecomment-689071719

comprehensiveMap commented 3 years ago

Sorry,i didn't speak clearly. I evaluated the data/test_dials, the ground truth using the script evaluate.py and get Matches(inform): 90.40, Success 82.3, i want to know are these the upper bound of metric inform and success? In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.

Hello, I am confused by this too. Have you solved this problem? In my opinion, there are some differences in the DAMD evaluation scripts, DAMD considers 'match' as 1 if the set of returned venues has overlap with the set of true venues. But in this script, as you see, the randomly selected one should be included in the set of true venues.

yunhaoli1995 commented 3 years ago

Sorry,i didn't speak clearly. I evaluated the data/test_dials, the ground truth using the script evaluate.py and get Matches(inform): 90.40, Success 82.3, i want to know are these the upper bound of metric inform and success? In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.

Hello, I am confused by this too. Have you solved this problem? In my opinion, there are some differences in the DAMD evaluation scripts, DAMD considers 'match' as 1 if the set of returned venues has overlap with the set of true venues. But in this script, as you see, the randomly selected one should be included in the set of true venues.

It's still unsolved. But at least I think models should be compared on the same evaluation script, otherwise, the comparison is meanless.