Open zqwerty opened 2 years ago
NLU benchmark for BERTNLU and MILU on multiwoz21, tm1, tm2, tm3
To illustrate that it is easy to use the model for any dataset that in our unified format, we report the performance on several datasets in our unified format. We follow README.md
and config files in unified_datasets/
to generate predictions.json
, then evaluate it using ../evaluate_unified_datasets.py
. Note that we use almost the same hyper-parameters for different datasets, which may not be optimal.
MultiWOZ 2.1 | Taskmaster-1 | Taskmaster-2 | Taskmaster-3 | |||||
---|---|---|---|---|---|---|---|---|
Model | Acc | F1 | Acc | F1 | Acc | F1 | Acc | F1 |
T5-small | 77.8 | 86.5 | 74.0 | 52.5 | 80.0 | 71.4 | 87.2 | 83.1 |
T5-small (context=3) | 82.0 | 90.3 | 76.2 | 56.2 | 82.4 | 74.3 | 89.0 | 85.1 |
BERTNLU | 74.5 | 85.9 | 72.8 | 50.6 | 79.2 | 70.6 | 86.1 | 81.9 |
BERTNLU (context=3) | 80.6 | 90.3 | 74.2 | 52.7 | 80.9 | 73.3 | 87.8 | 83.8 |
MILU | 72.9 | 85.2 | 72.9 | 49.2 | 79.1 | 68.7 | 85.4 | 80.3 |
MILU (context=3) | 76.6 | 87.9 | 72.4 | 48.5 | 78.9 | 68.4 | 85.1 | 80.1 |
Describe the feature Provide unified, fast evaluation script for Benchmark using unified datasets #11 and standard dataloaders #14 Evaluation metrics:
Expected behavior The scripts can evaluate different models of a module in the same ways to make the results comparable.
Additional context Previous evaluation scripts
convlab2/$module/evaluate.py
could be a reference but they do not use batch inference.