Open yunhaoli1995 opened 4 years ago
Hi, can you explain in more detailed what models you evaluated?
Sorry,i didn't speak clearly. I evaluated the data/test_dials, the ground truth using the script evaluate.py and get Matches(inform): 90.40, Success 82.3, i want to know are these the upper bound of metric inform and success? In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.
I think there is a likely upper bound on Inform Rate of 91.6% on the MultiWOZ 2.0 test of set due to a combination of the implementation of Inform Rate and errors in the belief state in the test set. This is based on the metric internally using the test-set to provide the "oracle" belief state, when sampling venues that the policy presents.
In practice evaluating the test set dialogues themselves (as per @leeyunhao) I got min 90.3%, max 90.9%, mean 90.54% +/- 0.46% (+/- 2 * STD) using 5 samples.
For more details see comment: https://github.com/budzianowski/multiwoz/issues/2#issuecomment-689071719
Sorry,i didn't speak clearly. I evaluated the data/test_dials, the ground truth using the script evaluate.py and get Matches(inform): 90.40, Success 82.3, i want to know are these the upper bound of metric inform and success? In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.
Hello, I am confused by this too. Have you solved this problem? In my opinion, there are some differences in the DAMD evaluation scripts, DAMD considers 'match' as 1 if the set of returned venues has overlap with the set of true venues. But in this script, as you see, the randomly selected one should be included in the set of true venues.
Sorry,i didn't speak clearly. I evaluated the data/test_dials, the ground truth using the script evaluate.py and get Matches(inform): 90.40, Success 82.3, i want to know are these the upper bound of metric inform and success? In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.
Hello, I am confused by this too. Have you solved this problem? In my opinion, there are some differences in the DAMD evaluation scripts, DAMD considers 'match' as 1 if the set of returned venues has overlap with the set of true venues. But in this script, as you see, the randomly selected one should be included in the set of true venues.
It's still unsolved. But at least I think models should be compared on the same evaluation script, otherwise, the comparison is meanless.
I run evaluate.py and get Matches(inform): 90.40, Success 82.3. Are these the upper bound of metric Inform and Success? In some paper, the inform and success rate can exceed 90.40,82.3. In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.