log update (results for mistral and llama)

tanikina commented 2 months ago

This adds evaluation for Mistral and Llama (with LoRA adapters and fine-tuning on top of the first 30 frozen layers).

Unfortunately, the results are worse compared to DeBERTa:

ArneBinder commented 2 months ago

How do the official metrics look for that?

tanikina commented 2 months ago

How do the official metrics look for that?

Here are the results based on src/evaluation/eval_official.py:

model	argument general-f1	argument focused-f1	illocution general-f1	illocution focused-f1
Mistral-30	0.481	0.242	0.855	0.705
Mistral-LoRA	0.544	0.314	0.839	0.687
Llama-30	0.445	0.211	0.844	0.682
Llama-LoRA	0.515	0.286	0.831	0.672
DeBERTa	0.589	0.361	0.846	0.697

Note that it is not unusual that RoBERTa and DeBERTa outperform Mistral and Llama on sequence classification tasks. See, e.g., this comparison.

ArneBinder / dialam-2024-shared-task