ConvLab / ConvLab-3

Apache License 2.0
107 stars 30 forks source link

[Feature] Benchmark Evaluation Script #15

Open zqwerty opened 2 years ago

zqwerty commented 2 years ago

Describe the feature Provide unified, fast evaluation script for Benchmark using unified datasets #11 and standard dataloaders #14 Evaluation metrics:

Expected behavior The scripts can evaluate different models of a module in the same ways to make the results comparable.

Additional context Previous evaluation scripts convlab2/$module/evaluate.py could be a reference but they do not use batch inference.

zqwerty commented 2 years ago

NLU benchmark for BERTNLU and MILU on multiwoz21, tm1, tm2, tm3

To illustrate that it is easy to use the model for any dataset that in our unified format, we report the performance on several datasets in our unified format. We follow README.md and config files in unified_datasets/ to generate predictions.json, then evaluate it using ../evaluate_unified_datasets.py. Note that we use almost the same hyper-parameters for different datasets, which may not be optimal.

- Acc: whether all dialogue acts of an utterance are correctly predicted - F1: F1 measure of the dialogue act predictions over the corpus.
MultiWOZ 2.1 Taskmaster-1 Taskmaster-2 Taskmaster-3
Model AccF1 AccF1 AccF1 AccF1
T5-small 77.886.5 74.052.5 80.071.4 87.283.1
T5-small (context=3) 82.090.3 76.256.2 82.474.3 89.085.1
BERTNLU 74.585.9 72.850.6 79.270.6 86.181.9
BERTNLU (context=3) 80.690.3 74.252.7 80.973.3 87.883.8
MILU 72.985.2 72.949.2 79.168.7 85.480.3
MILU (context=3) 76.687.9 72.448.5 78.968.4 85.180.1