ArneBinder / dialam-2024-shared-task

see http://dialam.arg.tech/
0 stars 0 forks source link

log update (results for extra models, weighted loss and augmented data) #37

Closed tanikina closed 1 month ago

tanikina commented 2 months ago

This adds the last batch of experiments with dialam2024_merged_relations config. The current best-performing model is microsoft/deberta-v3-large. Below are the evaluation results on the validation set for different settings (including weighted loss, training with only 20 most frequent classes etc.):

model run macro-f1 micro-f1 args-general-f1 args-focused-f1 illoc-general-f1 illoc-focused-f1
BART W&B run 0.384 0.704 0.568 0.335 0.852 0.702
XLNet W&B run 0.390 0.706 0.527 0.294 0.856 0.703
DeBERTa W&B run 0.428 0.712 0.589 0.361 0.846 0.697
DeBERTa with weighted loss W&B run 0.417 0.719 0.587 0.363 0.856 0.700
DeBERTa with only frequent classes (20 vs 26 classes in total) W&B run 0.432 0.724 0.584 0.349 0.857 0.701
DeBERTa with augmented data (paraphrased L-node texts) W&B run 0.403 0.711 0.564 0.335 0.859 0.707
DeBERTa-v3 W&B run 0.431 0.723 0.601 0.363 0.851 0.704
DeBERTa-v3 original + EDA-augmented data (combined train set) W&B run 0.399 0.716 0.585 0.364 0.850 0.698
DeBERTa-v3 additionally fine-tuned on EDA-augmented data W&B run 0.398 0.715 0.571 0.332 0.856 0.700
DeBERTa-v3 with only "officially blacklisted" nodesets W&B run 0.449 0.715 0.557 0.323 0.852 0.707
ArneBinder commented 1 month ago

Per-class result (F1) for best model (DeBERTa-v3)

show code ```python from io import StringIO import pandas as pd # taken from the W&B run: https://wandb.ai/tanikina/dialam2024_merged_relations-re_text_classification_with_indices-training/runs/z46l0wk9/overview train_data = { "metric/ya_i2l_nodes:Challenging/f1/train": 1, "epoch": 20, "_runtime": 17006.133455753326, "loss/train_epoch": 0.000016656018487992696, "trainer/global_step": 126820, "metric/no_relation/f1/train": 0, "metric/ya_i2l_nodes:Asserting/f1/val": 0.9877777695655824, "metric/ya_s2ta_nodes:Pure Questioning/f1/val": 0, "metric/ya_i2l_nodes:Pure Questioning/f1/train": 1, "loss/val": 2.0919082164764404, "_timestamp": 1714963625.0814097, "metric/ya_s2ta_nodes:Rhetorical Questioning/f1/val": 0, "metric/s_nodes:NONE/f1/val": 0.6990595459938049, "metric/ya_i2l_nodes:Restating/f1/val": 0, "metric/ya_i2l_nodes:Pure Questioning/f1/val": 0.8068669438362122, "metric/ya_i2l_nodes:Default Illocuting/f1/train": 1, "metric/ya_i2l_nodes:NONE/f1/val": 0.6341463327407837, "metric/ya_i2l_nodes:Arguing/f1/train": 1, "metric/ya_i2l_nodes:Agreeing/f1/val": 0, "metric/ya_s2ta_nodes:Arguing/f1/val": 0.4791666567325592, "metric/ya_i2l_nodes:Default Illocuting/f1/val": 0, "metric/ya_i2l_nodes:Rhetorical Questioning/f1/train": 1, "metric/macro/f1/val": 0.4311785995960235, "metric/micro/f1/train": 1, "metric/ya_s2ta_nodes:Default Illocuting/f1/val": 0.6185566782951355, "metric/ya_s2ta_nodes:Pure Questioning/f1/train": 1, "metric/micro/f1/val": 0.7233786582946777, "metric/s_nodes:Default Inference-rev/f1/val": 0.3452380895614624, "metric/ya_i2l_nodes:Challenging/f1/val": 0.5, "metric/ya_s2ta_nodes:Asserting/f1/train": 1, "metric/s_nodes:Default Inference-rev/f1/train": 1, "metric/ya_i2l_nodes:Assertive Questioning/f1/val": 0.3448275923728943, "metric/ya_s2ta_nodes:Default Illocuting/f1/train": 1, "metric/s_nodes:Default Rephrase/f1/val": 0.5588972568511963, "metric/ya_i2l_nodes:Asserting/f1/train": 1, "metric/no_relation/f1/val": 0, "metric/ya_s2ta_nodes:Arguing/f1/train": 1, "metric/ya_s2ta_nodes:Disagreeing/f1/val": 0.3174603283405304, "metric/ya_s2ta_nodes:Challenging/f1/val": 0, "metric/ya_s2ta_nodes:NONE/f1/val": 0.722634494304657, "metric/ya_i2l_nodes:Agreeing/f1/train": 1, "metric/s_nodes:Default Conflict/f1/val": 0.3382352888584137, "metric/ya_i2l_nodes:NONE/f1/train": 1, "metric/ya_i2l_nodes:Arguing/f1/val": 0, "loss/train_step": 0.000003531548372848192, "metric/ya_s2ta_nodes:Asserting/f1/val": 0, "metric/ya_i2l_nodes:Restating/f1/train": 1, "metric/s_nodes:Default Inference/f1/train": 1, "metric/ya_i2l_nodes:Rhetorical Questioning/f1/val": 0.32258063554763794, "metric/s_nodes:NONE/f1/train": 1, "metric/ya_s2ta_nodes:NONE/f1/train": 1, "metric/ya_s2ta_nodes:Rhetorical Questioning/f1/train": 1, "metric/ya_s2ta_nodes:Agreeing/f1/val": 0, "metric/ya_s2ta_nodes:Restating/f1/val": 0.5195530652999878, "metric/ya_s2ta_nodes:Agreeing/f1/train": 1, "metric/s_nodes:Default Inference/f1/val": 0.4285714328289032, "metric/ya_s2ta_nodes:Restating/f1/train": 1, "metric/s_nodes:Default Conflict/f1/train": 1, "metric/ya_s2ta_nodes:Challenging/f1/train": 1, "metric/ya_s2ta_nodes:Disagreeing/f1/train": 1, "_step": 2576, "_wandb.runtime": 17006, "metric/ya_i2l_nodes:Assertive Questioning/f1/train": 1, "metric/macro/f1/train": 1, "metric/s_nodes:Default Rephrase/f1/train": 1 } # taken from here: https://github.com/ArneBinder/dialam-2024-shared-task/pull/26#issuecomment-2089081702 val_support_data = """ | | s_nodes:Default Conflict | s_nodes:Default Inference | s_nodes:Default Inference-rev | s_nodes:Default Rephrase | s_nodes:NONE | ya_i2l_nodes:Agreeing | ya_i2l_nodes:Asserting | ya_i2l_nodes:Assertive Questioning | ya_i2l_nodes:Challenging | ya_i2l_nodes:Default Illocuting | ya_i2l_nodes:NONE | ya_i2l_nodes:Pure Questioning | ya_i2l_nodes:Rhetorical Questioning | ya_s2ta_nodes:Agreeing | ya_s2ta_nodes:Arguing | ya_s2ta_nodes:Challenging | ya_s2ta_nodes:Default Illocuting | ya_s2ta_nodes:Disagreeing | ya_s2ta_nodes:NONE | ya_s2ta_nodes:Restating | |:----------|---------------------------:|----------------------------:|--------------------------------:|---------------------------:|---------------:|------------------------:|-------------------------:|-------------------------------------:|---------------------------:|----------------------------------:|--------------------:|--------------------------------:|--------------------------------------:|-------------------------:|------------------------:|----------------------------:|-----------------------------------:|----------------------------:|---------------------:|--------------------------:| | available | 92 | 246 | 211 | 447 | 862 | 2 | 1795 | 14 | 1 | 6 | 22 | 120 | 13 | 3 | 444 | 2 | 53 | 90 | 1005 | 385 | | used | 92 | 246 | 211 | 447 | 862 | 2 | 1795 | 14 | 1 | 6 | 22 | 120 | 13 | 3 | 444 | 2 | 53 | 90 | 1005 | 385 | """ def read_markdown_table(md_table_string: str) -> pd.DataFrame: return pd.read_csv( StringIO(md_table_string), sep='\\s*\|\\s*', index_col=1, engine='python', ).dropna( axis=1, how='all' ).iloc[1:] def plot_with_matplotlib(df: pd.DataFrame): from matplotlib import pyplot as plt df.sort_values("support").plot(backend="matplotlib", kind="bar", secondary_y=["support"]) plt.show() def plot_with_plotly(df: pd.DataFrame): import plotly.graph_objects as go fig = go.Figure( data=[ go.Bar(name=df["value"].name, x=df.index, y=df["value"], yaxis='y', offsetgroup=1, text=df["value"]), go.Bar(name=df["support"].name, x=df.index, y=df["support"], yaxis='y2', offsetgroup=2, text=df["support"]), ], layout={ 'yaxis': {'title': df["value"].name}, 'yaxis2': {'title': df["support"].name, 'overlaying': 'y', 'side': 'right'} } ) fig.show() if __name__ == "__main__": metric_data = {tuple(k.split("/")): v for k, v in train_data.items() if k.startswith("metric/")} s_metric = pd.Series(metric_data) s_metric.index = s_metric.index.droplevel(0).droplevel(1) df_metric = s_metric.unstack() s_metric_val = df_metric['val'] s_metric_val.name = 'value' # get support from markdown table df_val_support = read_markdown_table(val_support_data) s_support_val = df_val_support.loc['used'].astype(int) s_support_val.name = 'support' result = pd.concat([s_metric_val, s_support_val], axis=1).fillna(0) result_sorted = result.sort_values("support") print(result_sorted.round(2).to_markdown()) with_support = result_sorted[result_sorted["support"] > 0] plot_with_plotly(with_support.round(2)) ```
value support
macro 0.43 0
ya_s2ta_nodes:Pure Questioning 0 0
ya_s2ta_nodes:Asserting 0 0
ya_i2l_nodes:Restating 0 0
ya_i2l_nodes:Arguing 0 0
ya_s2ta_nodes:Rhetorical Questioning 0 0
micro 0.72 0
no_relation 0 0
ya_i2l_nodes:Challenging 0.5 1
ya_i2l_nodes:Agreeing 0 2
ya_s2ta_nodes:Challenging 0 2
ya_s2ta_nodes:Agreeing 0 3
ya_i2l_nodes:Default Illocuting 0 6
ya_i2l_nodes:Rhetorical Questioning 0.32 13
ya_i2l_nodes:Assertive Questioning 0.34 14
ya_i2l_nodes:NONE 0.63 22
ya_s2ta_nodes:Default Illocuting 0.62 53
ya_s2ta_nodes:Disagreeing 0.32 90
s_nodes:Default Conflict 0.34 92
ya_i2l_nodes:Pure Questioning 0.81 120
s_nodes:Default Inference-rev 0.35 211
s_nodes:Default Inference 0.43 246
ya_s2ta_nodes:Restating 0.52 385
ya_s2ta_nodes:Arguing 0.48 444
s_nodes:Default Rephrase 0.56 447
s_nodes:NONE 0.7 862
ya_s2ta_nodes:NONE 0.72 1005
ya_i2l_nodes:Asserting 0.99 1795

Bildschirmfoto vom 2024-05-17 20-35-33

ArneBinder commented 1 month ago

full result on val data:

python src/evaluation/eval_official.py --gold_dir=data/train --predictions_dir=data/validation_annotated_deberta_v3 --mode=arguments

general.p: 0.6643397895535604 general.r: 0.5891417270425421 general.f1: 0.6005552985712661 focused.p: 0.49017500285485893 focused.r: 0.31741342856450766 focused.f1: 0.36328749501622953

python3 src/evaluation/eval_official.py --gold_dir=data/train --predictions_dir=data/validation_annotated_deberta_v3 --mode=illocutions

general.p: 0.8622892315729807 general.r: 0.847089799405591 general.f1: 0.8511310912354281 focused.p: 0.717489532623461 focused.r: 0.6981894061035216 focused.f1: 0.7034717513399968