Closed tanikina closed 2 months ago
Results for training RoBERTa, DeBERTA, ELECTRA and RemBERT on the DialAM data with the fixed validation split with experiment config dialam2024_merged_relations
and task_learning_rate=1e-4
(https://github.com/ArneBinder/dialam-2024-shared-task/issues/21).
Summary of the evaluation on the validation set (averaged across 3 runs):
model | max_window | avg macro-f1 | avg micro-f1 |
---|---|---|---|
roberta-large | 512 | 0.395 | 0.712 |
roberta-large | 128 | 0.380 | 0.713 |
deberta-large | 512 | 0.412 | 0.715 |
deberta-large | 128 | 0.386 | 0.719 |
rembert | 512 | 0.387 | 0.699 |
electra | 512 | 0.303 | 0.630 |
After running the conversion script and src/evaluation/eval_official.py
we get the following results for DeBERTa and RoBERTa models:
metric | deberta-train | roberta-train | deberta-val | roberta-val |
---|---|---|---|---|
arguments/general.f1 | 0.8001 | 0.7989 | 0.6729 | 0.6508 |
arguments/focused.f1 | 0.5558 | 0.5542 | 0.4164 | 0.3885 |
illocutions/general.f1 | 0.9569 | 0.9568 | 0.8464 | 0.8482 |
illocutions/focused.f1 | 0.8433 | 0.8430 | 0.6965 | 0.6944 |
Evaluation results for the same models after removing the double edges for reversed S-relations and updating the S-node types. The scores for illocutions
remained the same. However, the scores for arguments
have changed. They increased on the training set (as expected) but, unfortunately, decreased on the validation set:
metric | deberta-train | roberta-train | deberta-val | roberta-val |
---|---|---|---|---|
arguments/general.f1 | 0.9154 | 0.9123 | 0.5888 | 0.5723 |
arguments/focused.f1 | 0.7561 | 0.7513 | 0.3607 | 0.3382 |
illocutions/general.f1 | 0.9569 | 0.9568 | 0.8464 | 0.8482 |
illocutions/focused.f1 | 0.8433 | 0.8430 | 0.6965 | 0.6944 |
s_nodes:Default Conflict | s_nodes:Default Inference | s_nodes:Default Inference-rev | s_nodes:Default Rephrase | s_nodes:NONE | ya_i2l_nodes:Agreeing | ya_i2l_nodes:Arguing | ya_i2l_nodes:Asserting | ya_i2l_nodes:Assertive Questioning | ya_i2l_nodes:Challenging | ya_i2l_nodes:Default Illocuting | ya_i2l_nodes:NONE | ya_i2l_nodes:Pure Questioning | ya_i2l_nodes:Restating | ya_i2l_nodes:Rhetorical Questioning | ya_s2ta_nodes:Agreeing | ya_s2ta_nodes:Arguing | ya_s2ta_nodes:Asserting | ya_s2ta_nodes:Challenging | ya_s2ta_nodes:Default Illocuting | ya_s2ta_nodes:Disagreeing | ya_s2ta_nodes:NONE | ya_s2ta_nodes:Pure Questioning | ya_s2ta_nodes:Restating | ya_s2ta_nodes:Rhetorical Questioning | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
available | 740 | 1949 | 1781 | 3624 | 8070 | 8 | 4 | 15587 | 203 | 23 | 21 | 309 | 940 | 5 | 192 | 16 | 3609 | 14 | 33 | 451 | 694 | 9202 | 7 | 3187 | 1 |
used | 740 | 1949 | 1781 | 3624 | 8070 | 8 | 4 | 15587 | 203 | 23 | 21 | 309 | 940 | 5 | 192 | 16 | 3609 | 14 | 33 | 451 | 694 | 9202 | 7 | 3187 | 1 |
s_nodes:Default Conflict | s_nodes:Default Inference | s_nodes:Default Inference-rev | s_nodes:Default Rephrase | s_nodes:NONE | ya_i2l_nodes:Agreeing | ya_i2l_nodes:Asserting | ya_i2l_nodes:Assertive Questioning | ya_i2l_nodes:Challenging | ya_i2l_nodes:Default Illocuting | ya_i2l_nodes:NONE | ya_i2l_nodes:Pure Questioning | ya_i2l_nodes:Rhetorical Questioning | ya_s2ta_nodes:Agreeing | ya_s2ta_nodes:Arguing | ya_s2ta_nodes:Challenging | ya_s2ta_nodes:Default Illocuting | ya_s2ta_nodes:Disagreeing | ya_s2ta_nodes:NONE | ya_s2ta_nodes:Restating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
available | 92 | 246 | 211 | 447 | 862 | 2 | 1795 | 14 | 1 | 6 | 22 | 120 | 13 | 3 | 444 | 2 | 53 | 90 | 1005 | 385 |
used | 92 | 246 | 211 | 447 | 862 | 2 | 1795 | 14 | 1 | 6 | 22 | 120 | 13 | 3 | 444 | 2 | 53 | 90 | 1005 | 385 |
Merged relations
Results for training
bert-base-uncased
androberta-large
on the DialAM data with the experiment configdialam2024_merged_relations
(https://github.com/ArneBinder/dialam-2024-shared-task/issues/21).Summary of the evaluation on the validation set (averaged across 3 runs):
Note that in these experiments we modified only the
task_learning_rate
and used the defaultlearning_rate
for the underlying model.Here are the results of base-model learning rate optimization for
roberta-large
with task learning rate 1e-4:Note that we can also modify
taskmodule.max_window
which is set to 512 be default, below are the experimental results withroberta-large
andtask_learning_rate=1e-4
(averaged across 3 runs):W&B project for merged relations
Single relations
Results for training
roberta-large
on the DialAM data with the experiment configs for S and YA relations (https://github.com/ArneBinder/dialam-2024-shared-task/issues/20).Summary of the evaluation on the validation set (averaged across 3 runs):
W&B project for single relations