ArneBinder/dialam-2024-shared-task

# DFKI-MLST@DialAM-2024 Shared Task

[![Paper](https://img.shields.io/badge/paper-2024.argmining.1.9-B31B1B.svg)](https://aclanthology.org/2024.argmining-1.9) [![Conference](https://img.shields.io/badge/ArgMining@ACL-2024-4b44ce.svg)](https://aclanthology.org/volumes/2024.argmining-1)

📌 Description

This repository contains the code for our submission to the DialAM-2024 Shared Task as described in the paper DFKI-MLST at DialAM-2024 Shared Task: System Description (Binder et al., ArgMining 2024) and poster presented at the ArgMining 2024 workshop co-located with ACL 2024 in Bangkok, Thailand. The task was part of the workshop and focused on the identification of argumentative relations in dialogues. See the official website for more information.

📃 Abstract

We present the dfki-mlst submission for the DialAM shared task on identification of argumentative and illocutionary relations in dialogue. Our model achieves the best results in the global setting: 48.25 F1 at the focused level when looking only at the related arguments/locutions and 67.05 F1 at the general level when evaluating the complete argument maps. We describe our implementation of the data pre-processing pipeline, relation encoding and classification, evaluating 11 different base models and performing experiments with, e.g., node text combination and data augmentation. Our source code is publicly available.

✨ How to Reproduce the Results from Our Paper

Set up the environment as described in the Environment Setup section.
Train models with the configuration from the paper (this will execute 3 runs with different seeds):
```
python src/train.py \
experiment=dialam2024_merged_relations \
base_model_name=microsoft/deberta-v3-large \
model.task_learning_rate=1e-4 \
+model.classifier_dropout=0.1 \
datamodule.batch_size=8 \
trainer=gpu \
logger=none \
seed=1,2,3 \
+hydra.callbacks.save_job_return.integrate_multirun_result=true \
--multirun
```
Notes:
- We set logger to none because the default logger is Weights & Biases, which requires an account and API key. To use it, remove logger=none and provide your API key as WANDB_API_KEY in the .env file. For alternative logging options, see configs in configs/logger/ which can be enabled by setting logger=LOGGER_CONFIG_NAME, e.g. logger=csv.
- To train on a CPU, remove the trainer=gpu parameter.
- You can set +trainer.fast_dev_run=true to run a quick development test with only two steps.

Run the inference on the test set (the model_save_dirs from the training step will be used as the model_name_or_path, see the content of the job_return_value.json in your logs/training folder for the exact paths):

python src/predict.py \
dataset=dialam2024_prepared \
+dataset.input.name=merged_relations \
model_name_or_path=MODEL/SAVE/DIR1,MODEL/SAVE/DIR2,MODEL/SAVE/DIR3 \
+pipeline.device=0 \
+pipeline.batch_size=8 \
--multirun

or run the inference with the model checkpoint from the paper (from Huggingface hub):

python src/predict.py \
dataset=dialam2024_prepared \
+dataset.input.name=merged_relations \
model_name_or_path=DFKI-SLT/dfki-mlst-deberta-v3 \
+pipeline.device=0 \
+pipeline.batch_size=8

Evaluate the results with the annotated shared task test data that are stored in data/evaluation_data. First, convert the serialized JSON documents into the JSON format required for the DialAM Shared Task with each nodeset in a separate JSON file (note that INPUT/DATA/DIR is the path to one of the directories where the predicted outputs from the previous step are stored):

python src/utils/convert_documents2nodesets.py \
--input_dir=INPUT/DATA/DIR \
--output_dir=PREDICTION/DATA/DIR

Second, evaluate using the official script for argumentative relations:

python src/evaluation/eval_official.py \
--gold_dir=data/evaluation_data \
--predictions_dir=PREDICTION/DATA/DIR \
--mode=arguments

... and for illocutionary relations:

python src/evaluation/eval_official.py \
--gold_dir=data/evaluation_data \
--predictions_dir=PREDICTION/DATA/DIR \
--mode=illocutions

Depending on the hardware that is used to do the predictions the results may slightly vary, for H100 we achieved the following scores for argumentative relations:

general.p: 0.61956118154322
general.r: 0.5328788951094529
general.f1: 0.5533113738298515
focused.p: 0.43876262626262624
focused.r: 0.2481962481962482
focused.f1: 0.3039512181223411

... and for illocutionary relations:

general.p: 0.8108195336938596
general.r: 0.7925352836170968
general.f1: 0.7878014615826178
focused.p: 0.691260002623639
focused.r: 0.6623577402989167
focused.f1: 0.6610087288770674

🔮 Inference with the Trained Model on New Data

Note: This requires to set up a Python environment as described in the Environment Setup section.

import json

# this import is necessary to load the pipeline
from pie_modules.taskmodules import RETextClassificationWithIndicesTaskModule
from pytorch_ie import AutoPipeline
from pytorch_ie.annotations import NaryRelation, LabeledSpan

from dataset_builders.pie.dialam2024.dialam2024 import PREFIX_SEPARATOR, merge_relations, unmerge_relations, \
    REVERSE_SUFFIX, NONE_LABEL, convert_to_document
from src.document.types import SimplifiedDialAM2024Document, TextDocumentWithLabeledEntitiesAndNaryRelations
from src.utils.nodeset_utils import Nodeset
from src.utils.prepare_data import prepare_nodeset

# load the model from the Huggingface hub
# to execute on a GPU, pass device="0" (and batch_size=...) to .from_pretrained
pipe = AutoPipeline.from_pretrained("DFKI-SLT/dfki-mlst-deberta-v3")
# disable because it can cause errors when no model inputs are created
pipe.taskmodule.collect_statistics = False

# load nodeset in the format of the shared task test data
path = "data/evaluation_data/test_map1.json"
with open(path, "r") as f:
    nodeset: Nodeset = json.load(f)
nodeset_id="test_map1"

# Cleanup the nodeset (remove isolated L-nodes, loops, etc.) and
# add candidate relation nodes with Label "NONE" for all three relation types,
# i.e. S-Nodes, YA-S2TA-Nodes, YA-I2L-Nodes.
cleaned_nodeset: Nodeset = prepare_nodeset(
    nodeset=nodeset,
    nodeset_id=nodeset_id,
    s_node_text=NONE_LABEL,
    ya_node_text=NONE_LABEL,
    s_node_type="RA",
    reversed_text_suffix=REVERSE_SUFFIX,
    l2i_similarity_measure="lcsstr",  # use longest common substring
    integrate_gold_data=False,
)

# convert the nodeset to a PyTorch-IE document
doc: SimplifiedDialAM2024Document = convert_to_document(nodeset=cleaned_nodeset, nodeset_id=nodeset_id)
# merge all relation layers into one, the labels are prefixed with the relation type (S, YA-S2TA, or YA-I2L)
doc: TextDocumentWithLabeledEntitiesAndNaryRelations = merge_relations(
    document=doc,
    labeled_span_layer="l_nodes",
    nary_relation_layers=["ya_i2l_nodes", "ya_s2ta_nodes", "s_nodes"],
    sep=PREFIX_SEPARATOR,
)

# inference (works also with multiple documents at once)
doc: TextDocumentWithLabeledEntitiesAndNaryRelations = pipe(doc)

doc: SimplifiedDialAM2024Document = unmerge_relations(document=doc, sep=PREFIX_SEPARATOR)

# helper structure to get the node IDs from the relation arguments, if needed
l_node2id = dict(zip(doc.l_nodes, doc.metadata["l_node_ids"]))
# example how to get the ID and role of the first argument of the first predicted S-node
s_node_example: NaryRelation = doc.s_nodes.predictions[0]
s_arg0: LabeledSpan = s_node_example.arguments[0]
s_arg0_role: str = s_node_example.roles[0]
# print the original node ID and argument role
print(l_node2id[s_arg0], s_arg0_role)
# 22_163907070207948843 source

# print all predictions
print("Predictions:")
print("S-Nodes:")
for rel in doc.s_nodes.predictions:
    if rel.label != "NONE":
       # Note: If the s-node label ends with REVERSE_SUFFIX ("-rev"), the argument roles should be swapped
       #    in a post-processing step to get the correct order of the arguments in the relation,
       #    i.e. (target, target, source) instead of (source, source, target).
       print(rel.resolve())
print("YA-S2TA-Nodes:")
for rel in doc.ya_i2l_nodes.predictions:
    if rel.label != "NONE":
        print(rel.resolve())
print("YA-I2L-Nodes:")
for rel in doc.ya_s2ta_nodes.predictions:
    if rel.label != "NONE":
        print(rel.resolve())

print("done")

# Predictions:
# S-Nodes:
# ('Default Rephrase', (('source', ('L', 'AudienceMember 20210912QT02: did what we were told')), ('target', ('L', 'AudienceMember 20210912QT02: We followed the guidance'))))
# ('Default Inference-rev', (('source', ('L', 'AudienceMember 20210912QT02: It makes me sick')), ('target', ('L', "AudienceMember 20210912QT02: Then you hear they're having Christmas parties while we are suffering"))))
# ...
# YA-S2TA-Nodes:
# ('Asserting', (('source', ('L', 'AudienceMember 20210912QT02: My parents both had COVID')),))
# ('Asserting', (('source', ('L', 'AudienceMember 20210912QT02: We followed the guidance')),))
# ...
# YA-I2L-Nodes:
# ('Restating', (('source', ('L', 'AudienceMember 20210912QT02: We followed the guidance')), ('target', ('L', 'AudienceMember 20210912QT02: did what we were told'))))
# ('Arguing', (('source', ('L', "AudienceMember 20210912QT02: Then you hear they're having Christmas parties while we are suffering")), ('target', ('L', 'AudienceMember 20210912QT02: It makes me sick'))))
# ...
# done

🚀 Quickstart

Environment Setup

# clone project
git clone https://github.com/ArneBinder/dialam-2024-shared-task.git
cd dialam-2024-shared-task

# [OPTIONAL] create conda environment
conda create -n dialam-2024-shared-task python=3.9
conda activate dialam-2024-shared-task

# install PyTorch according to instructions
# https://pytorch.org/get-started/

# install remaining requirements
pip install -r requirements.txt

# [OPTIONAL] symlink log directories and the default model directory to
# "$HOME/experiments/dialam-2024-shared-task" since they can grow a lot
bash setup_symlinks.sh $HOME/experiments/dialam-2024-shared-task

# [OPTIONAL] set any environment variables by creating an .env file
# 1. copy the provided example file:
cp .env.example .env
# 2. edit the .env file for your needs!

Model Training

Have a look into the train.yaml config to see all available options.

Train model with default configuration

# train on CPU
python src/train.py

# train on GPU
python src/train.py trainer=gpu

Execute a fast development run (train for two steps)

python src/train.py +trainer.fast_dev_run=true

Train model with chosen experiment configuration from configs/experiment/

python src/train.py experiment=conll2003

You can override any parameter from command line like this

python train.py trainer.max_epochs=20 datamodule.batch_size=64

Start multiple runs at once (multirun):

python src/train.py seed=42,43 --multirun

Notes:

this will execute two experiments (one after the other), one for each seed
the results will be aggregated and stored in logs/multirun/, see the last logging output for the exact path

Model evaluation

This will evaluate the model on the test set of the chosen dataset using the metrics implemented within the model. See config/dataset/ for available datasets.

Have a look into the evaluate.yaml config to see all available options.

python src/evaluate.py dataset=conll2003 model_name_or_path=pie/example-ner-spanclf-conll03

Notes:

add the command line parameter trainer=gpu to run on GPU

Inference

This will run inference on the given dataset and split. See config/dataset/ for available datasets. The result documents including the predicted annotations will be stored in the predictions/ directory (exact location will be printed to the console).

Have a look into the predict.yaml config to see all available options.

python src/predict.py dataset=conll2003 model_name_or_path=pie/example-ner-spanclf-conll03

Notes:

add the command line parameter +pipeline.device=0 to run the inference on GPU 0

Evaluate Serialized Documents

This will evaluate serialized documents including predicted annotations (see Inference) using a document metric. See config/metric/ for available metrics.

Have a look into the evaluate_documents.yaml config to see all available options

python src/evaluate_documents.py metric=f1 metric.layer=entities +dataset.data_dir=PATH/TO/DIR/WITH/SPLITS

Note: By default, this utilizes the dataset provided by the from_serialized_documents configuration. This configuration is designed to facilitate the loading of serialized documents, as generated during the Inference step. It requires to set the parameter data_dir. If you want to use a different dataset, you can override the dataset parameter as usual with any existing dataset config, e.g dataset=conll2003. But calculating the F1 score on the bare conll2003 dataset does not make much sense, because it does not contain any predictions. However, it could be used with statistical metrics such as count_text_tokens or count_entity_labels.

Development

# run pre-commit: code formatting, code analysis, static type checking, and more (see .pre-commit-config.yaml)
pre-commit run -a

# run tests
pytest -k "not slow" --cov --cov-report term-missing

📃 Citation

@inproceedings{binder-etal-2024-dfki,
    title = "{DFKI}-{MLST} at {D}ial{AM}-2024 Shared Task: System Description",
    author = "Binder, Arne  and
      Anikina, Tatiana  and
      Hennig, Leonhard  and
      Ostermann, Simon",
    editor = "Ajjour, Yamen  and
      Bar-Haim, Roy  and
      El Baff, Roxanne  and
      Liu, Zhexiong  and
      Skitalinskaya, Gabriella",
    booktitle = "Proceedings of the 11th Workshop on Argument Mining (ArgMining 2024)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.argmining-1.9",
    doi = "10.18653/v1/2024.argmining-1.9",
    pages = "93--102",
    abstract = "This paper presents the dfki-mlst submission for the DialAM shared task (Ruiz-Dolz et al., 2024) on identification of argumentative and illocutionary relations in dialogue. Our model achieves best results in the global setting: 48.25 F1 at the focused level when looking only at the related arguments/locutions and 67.05 F1 at the general level when evaluating the complete argument maps. We describe our implementation of the data pre-processing, relation encoding and classification, evaluating 11 different base models and performing experiments with, e.g., node text combination and data augmentation. Our source code is publicly available.",
}