tanikina commented 2 months ago

Merged relations

Results for training bert-base-uncased and roberta-large on the DialAM data with the experiment config dialam2024_merged_relations (https://github.com/ArneBinder/dialam-2024-shared-task/issues/21).

Summary of the evaluation on the validation set (averaged across 3 runs):

model	task lr	avg macro-f1	avg micro-f1
bert-base-uncased	1e-3	0.282	0.656
bert-base-uncased	1e-4	0.276	0.649
bert-base-uncased	2e-5	0.271	0.649
roberta-large	1e-3	0.374	0.727
roberta-large	1e-4	0.375	0.720
roberta-large	2e-5	0.360	0.715

Note that in these experiments we modified only the task_learning_rate and used the default learning_rate for the underlying model.

Here are the results of base-model learning rate optimization for roberta-large with task learning rate 1e-4:

model	model lr	avg macro-f1	avg micro-f1
roberta-large	1e-6	0.217	0.496
roberta-large	2e-6	0.331	0.678
roberta-large	1e-5	0.375	0.720
roberta-large	2e-5	0.223	0.638
roberta-large	1e-4	0.160	0.591
roberta-large	2e-4	0.042	0.433

Note that we can also modify taskmodule.max_window which is set to 512 be default, below are the experimental results with roberta-large and task_learning_rate=1e-4 (averaged across 3 runs):

model	max_window	avg macro-f1	avg micro-f1
roberta-large	512	0.375	0.720
roberta-large	256	0.378	0.718
roberta-large	128	0.400	0.719
roberta-large	64	0.353	0.733

W&B project for merged relations

Single relations

Results for training roberta-large on the DialAM data with the experiment configs for S and YA relations (https://github.com/ArneBinder/dialam-2024-shared-task/issues/20).

Summary of the evaluation on the validation set (averaged across 3 runs):

config	model lr	avg macro-f1	avg micro-f1
dialam2024_s	1e-4	0.393	0.469
dialam2024_ya_s2ta	1e-4	0.266	0.484
dialam2024_ya_i2l	1e-4	0.357	0.960

W&B project for single relations

tanikina commented 2 months ago

Merged relations

Results for training RoBERTa, DeBERTA, ELECTRA and RemBERT on the DialAM data with the fixed validation split with experiment config dialam2024_merged_relations and task_learning_rate=1e-4 (https://github.com/ArneBinder/dialam-2024-shared-task/issues/21).

Summary of the evaluation on the validation set (averaged across 3 runs):

model	max_window	avg macro-f1	avg micro-f1
roberta-large	512	0.395	0.712
roberta-large	128	0.380	0.713
deberta-large	512	0.412	0.715
deberta-large	128	0.386	0.719
rembert	512	0.387	0.699
electra	512	0.303	0.630

W&B project

After running the conversion script and src/evaluation/eval_official.py we get the following results for DeBERTa and RoBERTa models:

metric	deberta-train	roberta-train	deberta-val	roberta-val
arguments/general.f1	0.8001	0.7989	0.6729	0.6508
arguments/focused.f1	0.5558	0.5542	0.4164	0.3885
illocutions/general.f1	0.9569	0.9568	0.8464	0.8482
illocutions/focused.f1	0.8433	0.8430	0.6965	0.6944

Evaluation results for the same models after removing the double edges for reversed S-relations and updating the S-node types. The scores for illocutions remained the same. However, the scores for arguments have changed. They increased on the training set (as expected) but, unfortunately, decreased on the validation set:

metric	deberta-train	roberta-train	deberta-val	roberta-val
arguments/general.f1	0.9154	0.9123	0.5888	0.5723
arguments/focused.f1	0.7561	0.7513	0.3607	0.3382
illocutions/general.f1	0.9569	0.9568	0.8464	0.8482
illocutions/focused.f1	0.8433	0.8430	0.6965	0.6944

ArneBinder commented 2 months ago

statistics about labels

train

	s_nodes:Default Conflict	s_nodes:Default Inference	s_nodes:Default Inference-rev	s_nodes:Default Rephrase	s_nodes:NONE	ya_i2l_nodes:Agreeing	ya_i2l_nodes:Arguing	ya_i2l_nodes:Asserting	ya_i2l_nodes:Assertive Questioning	ya_i2l_nodes:Challenging	ya_i2l_nodes:Default Illocuting	ya_i2l_nodes:NONE	ya_i2l_nodes:Pure Questioning	ya_i2l_nodes:Restating	ya_i2l_nodes:Rhetorical Questioning	ya_s2ta_nodes:Agreeing	ya_s2ta_nodes:Arguing	ya_s2ta_nodes:Asserting	ya_s2ta_nodes:Challenging	ya_s2ta_nodes:Default Illocuting	ya_s2ta_nodes:Disagreeing	ya_s2ta_nodes:NONE	ya_s2ta_nodes:Pure Questioning	ya_s2ta_nodes:Restating	ya_s2ta_nodes:Rhetorical Questioning
available	740	1949	1781	3624	8070	8	4	15587	203	23	21	309	940	5	192	16	3609	14	33	451	694	9202	7	3187	1
used	740	1949	1781	3624	8070	8	4	15587	203	23	21	309	940	5	192	16	3609	14	33	451	694	9202	7	3187	1

validation

	s_nodes:Default Conflict	s_nodes:Default Inference	s_nodes:Default Inference-rev	s_nodes:Default Rephrase	s_nodes:NONE	ya_i2l_nodes:Agreeing	ya_i2l_nodes:Asserting	ya_i2l_nodes:Assertive Questioning	ya_i2l_nodes:Challenging	ya_i2l_nodes:Default Illocuting	ya_i2l_nodes:NONE	ya_i2l_nodes:Pure Questioning	ya_i2l_nodes:Rhetorical Questioning	ya_s2ta_nodes:Agreeing	ya_s2ta_nodes:Arguing	ya_s2ta_nodes:Challenging	ya_s2ta_nodes:Default Illocuting	ya_s2ta_nodes:Disagreeing	ya_s2ta_nodes:NONE	ya_s2ta_nodes:Restating
available	92	246	211	447	862	2	1795	14	1	6	22	120	13	3	444	2	53	90	1005	385
used	92	246	211	447	862	2	1795	14	1	6	22	120	13	3	444	2	53	90	1005	385

ArneBinder / dialam-2024-shared-task

first training results #26

Merged relations

Single relations

Merged relations

statistics about labels

train

validation