Closed tanikina closed 1 month ago
How do the official metrics look for that?
How do the official metrics look for that?
Here are the results based on src/evaluation/eval_official.py
:
model | argument general-f1 | argument focused-f1 | illocution general-f1 | illocution focused-f1 |
---|---|---|---|---|
Mistral-30 | 0.481 | 0.242 | 0.855 | 0.705 |
Mistral-LoRA | 0.544 | 0.314 | 0.839 | 0.687 |
Llama-30 | 0.445 | 0.211 | 0.844 | 0.682 |
Llama-LoRA | 0.515 | 0.286 | 0.831 | 0.672 |
DeBERTa | 0.589 | 0.361 | 0.846 | 0.697 |
Note that it is not unusual that RoBERTa and DeBERTa outperform Mistral and Llama on sequence classification tasks. See, e.g., this comparison.
This adds evaluation for Mistral and Llama (with LoRA adapters and fine-tuning on top of the first 30 frozen layers).
Unfortunately, the results are worse compared to DeBERTa: