ArneBinder / dialam-2024-shared-task

see http://dialam.arg.tech/
0 stars 0 forks source link

first training results #26

Closed tanikina closed 2 months ago

tanikina commented 2 months ago

Merged relations

Results for training bert-base-uncased and roberta-large on the DialAM data with the experiment config dialam2024_merged_relations (https://github.com/ArneBinder/dialam-2024-shared-task/issues/21).

Summary of the evaluation on the validation set (averaged across 3 runs):

model task lr avg macro-f1 avg micro-f1
bert-base-uncased 1e-3 0.282 0.656
bert-base-uncased 1e-4 0.276 0.649
bert-base-uncased 2e-5 0.271 0.649
roberta-large 1e-3 0.374 0.727
roberta-large 1e-4 0.375 0.720
roberta-large 2e-5 0.360 0.715

Note that in these experiments we modified only the task_learning_rate and used the default learning_rate for the underlying model.

Here are the results of base-model learning rate optimization for roberta-large with task learning rate 1e-4:

model model lr avg macro-f1 avg micro-f1
roberta-large 1e-6 0.217 0.496
roberta-large 2e-6 0.331 0.678
roberta-large 1e-5 0.375 0.720
roberta-large 2e-5 0.223 0.638
roberta-large 1e-4 0.160 0.591
roberta-large 2e-4 0.042 0.433

Note that we can also modify taskmodule.max_window which is set to 512 be default, below are the experimental results with roberta-large and task_learning_rate=1e-4 (averaged across 3 runs):

model max_window avg macro-f1 avg micro-f1
roberta-large 512 0.375 0.720
roberta-large 256 0.378 0.718
roberta-large 128 0.400 0.719
roberta-large 64 0.353 0.733

W&B project for merged relations

Single relations

Results for training roberta-large on the DialAM data with the experiment configs for S and YA relations (https://github.com/ArneBinder/dialam-2024-shared-task/issues/20).

Summary of the evaluation on the validation set (averaged across 3 runs):

config model lr avg macro-f1 avg micro-f1
dialam2024_s 1e-4 0.393 0.469
dialam2024_ya_s2ta 1e-4 0.266 0.484
dialam2024_ya_i2l 1e-4 0.357 0.960

W&B project for single relations

tanikina commented 2 months ago

Merged relations

Results for training RoBERTa, DeBERTA, ELECTRA and RemBERT on the DialAM data with the fixed validation split with experiment config dialam2024_merged_relations and task_learning_rate=1e-4 (https://github.com/ArneBinder/dialam-2024-shared-task/issues/21).

Summary of the evaluation on the validation set (averaged across 3 runs):

model max_window avg macro-f1 avg micro-f1
roberta-large 512 0.395 0.712
roberta-large 128 0.380 0.713
deberta-large 512 0.412 0.715
deberta-large 128 0.386 0.719
rembert 512 0.387 0.699
electra 512 0.303 0.630

W&B project

After running the conversion script and src/evaluation/eval_official.py we get the following results for DeBERTa and RoBERTa models:

metric deberta-train roberta-train deberta-val roberta-val
arguments/general.f1 0.8001 0.7989 0.6729 0.6508
arguments/focused.f1 0.5558 0.5542 0.4164 0.3885
illocutions/general.f1 0.9569 0.9568 0.8464 0.8482
illocutions/focused.f1 0.8433 0.8430 0.6965 0.6944

Evaluation results for the same models after removing the double edges for reversed S-relations and updating the S-node types. The scores for illocutions remained the same. However, the scores for arguments have changed. They increased on the training set (as expected) but, unfortunately, decreased on the validation set:

metric deberta-train roberta-train deberta-val roberta-val
arguments/general.f1 0.9154 0.9123 0.5888 0.5723
arguments/focused.f1 0.7561 0.7513 0.3607 0.3382
illocutions/general.f1 0.9569 0.9568 0.8464 0.8482
illocutions/focused.f1 0.8433 0.8430 0.6965 0.6944
ArneBinder commented 2 months ago

statistics about labels

train

s_nodes:Default Conflict s_nodes:Default Inference s_nodes:Default Inference-rev s_nodes:Default Rephrase s_nodes:NONE ya_i2l_nodes:Agreeing ya_i2l_nodes:Arguing ya_i2l_nodes:Asserting ya_i2l_nodes:Assertive Questioning ya_i2l_nodes:Challenging ya_i2l_nodes:Default Illocuting ya_i2l_nodes:NONE ya_i2l_nodes:Pure Questioning ya_i2l_nodes:Restating ya_i2l_nodes:Rhetorical Questioning ya_s2ta_nodes:Agreeing ya_s2ta_nodes:Arguing ya_s2ta_nodes:Asserting ya_s2ta_nodes:Challenging ya_s2ta_nodes:Default Illocuting ya_s2ta_nodes:Disagreeing ya_s2ta_nodes:NONE ya_s2ta_nodes:Pure Questioning ya_s2ta_nodes:Restating ya_s2ta_nodes:Rhetorical Questioning
available 740 1949 1781 3624 8070 8 4 15587 203 23 21 309 940 5 192 16 3609 14 33 451 694 9202 7 3187 1
used 740 1949 1781 3624 8070 8 4 15587 203 23 21 309 940 5 192 16 3609 14 33 451 694 9202 7 3187 1

validation

s_nodes:Default Conflict s_nodes:Default Inference s_nodes:Default Inference-rev s_nodes:Default Rephrase s_nodes:NONE ya_i2l_nodes:Agreeing ya_i2l_nodes:Asserting ya_i2l_nodes:Assertive Questioning ya_i2l_nodes:Challenging ya_i2l_nodes:Default Illocuting ya_i2l_nodes:NONE ya_i2l_nodes:Pure Questioning ya_i2l_nodes:Rhetorical Questioning ya_s2ta_nodes:Agreeing ya_s2ta_nodes:Arguing ya_s2ta_nodes:Challenging ya_s2ta_nodes:Default Illocuting ya_s2ta_nodes:Disagreeing ya_s2ta_nodes:NONE ya_s2ta_nodes:Restating
available 92 246 211 447 862 2 1795 14 1 6 22 120 13 3 444 2 53 90 1005 385
used 92 246 211 447 862 2 1795 14 1 6 22 120 13 3 444 2 53 90 1005 385