[Feature] Benchmark Evaluation Script

NLU benchmark for BERTNLU and MILU on multiwoz21, tm1, tm2, tm3

To illustrate that it is easy to use the model for any dataset that in our unified format, we report the performance on several datasets in our unified format. We follow README.md and config files in unified_datasets/ to generate predictions.json, then evaluate it using ../evaluate_unified_datasets.py. Note that we use almost the same hyper-parameters for different datasets, which may not be optimal.

- Acc: whether all dialogue acts of an utterance are correctly predicted - F1: F1 measure of the dialogue act predictions over the corpus.

	MultiWOZ 2.1		Taskmaster-1		Taskmaster-2		Taskmaster-3
Model	Acc	F1	Acc	F1	Acc	F1	Acc	F1
T5-small	77.8	86.5	74.0	52.5	80.0	71.4	87.2	83.1
T5-small (context=3)	82.0	90.3	76.2	56.2	82.4	74.3	89.0	85.1
BERTNLU	74.5	85.9	72.8	50.6	79.2	70.6	86.1	81.9
BERTNLU (context=3)	80.6	90.3	74.2	52.7	80.9	73.3	87.8	83.8
MILU	72.9	85.2	72.9	49.2	79.1	68.7	85.4	80.3
MILU (context=3)	76.6	87.9	72.4	48.5	78.9	68.4	85.1	80.1

ConvLab / ConvLab-3

[Feature] Benchmark Evaluation Script #15