ArneBinder / dialam-2024-shared-task

see http://dialam.arg.tech/
0 stars 0 forks source link

log update (results for mistral and llama) #36

Closed tanikina closed 1 month ago

tanikina commented 2 months ago

This adds evaluation for Mistral and Llama (with LoRA adapters and fine-tuning on top of the first 30 frozen layers).

Unfortunately, the results are worse compared to DeBERTa:

model macro-f1 micro-f1
Mistral-30-frozen 0.322 0.661
Mistral-LoRA 0.319 0.684
Llama-30-frozen 0.289 0.639
Llama-LoRA 0.324 0.682
DeBERTa 0.412 0.715
ArneBinder commented 2 months ago

How do the official metrics look for that?

tanikina commented 2 months ago

How do the official metrics look for that?

Here are the results based on src/evaluation/eval_official.py:

model argument general-f1 argument focused-f1 illocution general-f1 illocution focused-f1
Mistral-30 0.481 0.242 0.855 0.705
Mistral-LoRA 0.544 0.314 0.839 0.687
Llama-30 0.445 0.211 0.844 0.682
Llama-LoRA 0.515 0.286 0.831 0.672
DeBERTa 0.589 0.361 0.846 0.697

Note that it is not unusual that RoBERTa and DeBERTa outperform Mistral and Llama on sequence classification tasks. See, e.g., this comparison.