This repository contains the code for our submission to the DialAM-2024 Shared Task as described in the paper DFKI-MLST at DialAM-2024 Shared Task: System Description (Binder et al., ArgMining 2024) and poster presented at the ArgMining 2024 workshop co-located with ACL 2024 in Bangkok, Thailand. The task was part of the workshop and focused on the identification of argumentative relations in dialogues. See the official website for more information.
We present the dfki-mlst submission for the DialAM shared task on identification of argumentative and illocutionary relations in dialogue. Our model achieves the best results in the global setting: 48.25 F1 at the focused level when looking only at the related arguments/locutions and 67.05 F1 at the general level when evaluating the complete argument maps. We describe our implementation of the data pre-processing pipeline, relation encoding and classification, evaluating 11 different base models and performing experiments with, e.g., node text combination and data augmentation. Our source code is publicly available.
Set up the environment as described in the Environment Setup section.
Train models with the configuration from the paper (this will execute 3 runs with different seeds):
python src/train.py \
experiment=dialam2024_merged_relations \
base_model_name=microsoft/deberta-v3-large \
model.task_learning_rate=1e-4 \
+model.classifier_dropout=0.1 \
datamodule.batch_size=8 \
trainer=gpu \
logger=none \
seed=1,2,3 \
+hydra.callbacks.save_job_return.integrate_multirun_result=true \
--multirun
Notes:
none
because the default logger is Weights & Biases, which requires
an account and API key. To use it, remove logger=none
and provide your API key as WANDB_API_KEY
in the
.env
file. For alternative logging options, see configs in configs/logger/
which can be enabled by setting
logger=LOGGER_CONFIG_NAME
, e.g. logger=csv
.trainer=gpu
parameter.+trainer.fast_dev_run=true
to run a quick development test with only two steps.Run the inference on the test set (the model_save_dir
s from the training step will be used as the
model_name_or_path
, see the content of the job_return_value.json
in your logs/training
folder
for the exact paths):
python src/predict.py \
dataset=dialam2024_prepared \
+dataset.input.name=merged_relations \
model_name_or_path=MODEL/SAVE/DIR1,MODEL/SAVE/DIR2,MODEL/SAVE/DIR3 \
+pipeline.device=0 \
+pipeline.batch_size=8 \
--multirun
or run the inference with the model checkpoint from the paper (from Huggingface hub):
python src/predict.py \
dataset=dialam2024_prepared \
+dataset.input.name=merged_relations \
model_name_or_path=DFKI-SLT/dfki-mlst-deberta-v3 \
+pipeline.device=0 \
+pipeline.batch_size=8
Evaluate the results with the annotated shared task test data that are stored in data/evaluation_data
.
First, convert the serialized JSON documents into the JSON format required for the DialAM Shared
Task with each nodeset in a separate JSON file (note that INPUT/DATA/DIR
is the path to one of the
directories where the predicted outputs from the previous step are stored):
python src/utils/convert_documents2nodesets.py \
--input_dir=INPUT/DATA/DIR \
--output_dir=PREDICTION/DATA/DIR
Second, evaluate using the official script for argumentative relations:
python src/evaluation/eval_official.py \
--gold_dir=data/evaluation_data \
--predictions_dir=PREDICTION/DATA/DIR \
--mode=arguments
... and for illocutionary relations:
python src/evaluation/eval_official.py \
--gold_dir=data/evaluation_data \
--predictions_dir=PREDICTION/DATA/DIR \
--mode=illocutions
Depending on the hardware that is used to do the predictions the results may slightly vary, for H100 we achieved the following scores for argumentative relations:
general.p: 0.61956118154322
general.r: 0.5328788951094529
general.f1: 0.5533113738298515
focused.p: 0.43876262626262624
focused.r: 0.2481962481962482
focused.f1: 0.3039512181223411
... and for illocutionary relations:
general.p: 0.8108195336938596
general.r: 0.7925352836170968
general.f1: 0.7878014615826178
focused.p: 0.691260002623639
focused.r: 0.6623577402989167
focused.f1: 0.6610087288770674
Note: This requires to set up a Python environment as described in the Environment Setup section.
import json
# this import is necessary to load the pipeline
from pie_modules.taskmodules import RETextClassificationWithIndicesTaskModule
from pytorch_ie import AutoPipeline
from pytorch_ie.annotations import NaryRelation, LabeledSpan
from dataset_builders.pie.dialam2024.dialam2024 import PREFIX_SEPARATOR, merge_relations, unmerge_relations, \
REVERSE_SUFFIX, NONE_LABEL, convert_to_document
from src.document.types import SimplifiedDialAM2024Document, TextDocumentWithLabeledEntitiesAndNaryRelations
from src.utils.nodeset_utils import Nodeset
from src.utils.prepare_data import prepare_nodeset
# load the model from the Huggingface hub
# to execute on a GPU, pass device="0" (and batch_size=...) to .from_pretrained
pipe = AutoPipeline.from_pretrained("DFKI-SLT/dfki-mlst-deberta-v3")
# disable because it can cause errors when no model inputs are created
pipe.taskmodule.collect_statistics = False
# load nodeset in the format of the shared task test data
path = "data/evaluation_data/test_map1.json"
with open(path, "r") as f:
nodeset: Nodeset = json.load(f)
nodeset_id="test_map1"
# Cleanup the nodeset (remove isolated L-nodes, loops, etc.) and
# add candidate relation nodes with Label "NONE" for all three relation types,
# i.e. S-Nodes, YA-S2TA-Nodes, YA-I2L-Nodes.
cleaned_nodeset: Nodeset = prepare_nodeset(
nodeset=nodeset,
nodeset_id=nodeset_id,
s_node_text=NONE_LABEL,
ya_node_text=NONE_LABEL,
s_node_type="RA",
reversed_text_suffix=REVERSE_SUFFIX,
l2i_similarity_measure="lcsstr", # use longest common substring
integrate_gold_data=False,
)
# convert the nodeset to a PyTorch-IE document
doc: SimplifiedDialAM2024Document = convert_to_document(nodeset=cleaned_nodeset, nodeset_id=nodeset_id)
# merge all relation layers into one, the labels are prefixed with the relation type (S, YA-S2TA, or YA-I2L)
doc: TextDocumentWithLabeledEntitiesAndNaryRelations = merge_relations(
document=doc,
labeled_span_layer="l_nodes",
nary_relation_layers=["ya_i2l_nodes", "ya_s2ta_nodes", "s_nodes"],
sep=PREFIX_SEPARATOR,
)
# inference (works also with multiple documents at once)
doc: TextDocumentWithLabeledEntitiesAndNaryRelations = pipe(doc)
doc: SimplifiedDialAM2024Document = unmerge_relations(document=doc, sep=PREFIX_SEPARATOR)
# helper structure to get the node IDs from the relation arguments, if needed
l_node2id = dict(zip(doc.l_nodes, doc.metadata["l_node_ids"]))
# example how to get the ID and role of the first argument of the first predicted S-node
s_node_example: NaryRelation = doc.s_nodes.predictions[0]
s_arg0: LabeledSpan = s_node_example.arguments[0]
s_arg0_role: str = s_node_example.roles[0]
# print the original node ID and argument role
print(l_node2id[s_arg0], s_arg0_role)
# 22_163907070207948843 source
# print all predictions
print("Predictions:")
print("S-Nodes:")
for rel in doc.s_nodes.predictions:
if rel.label != "NONE":
# Note: If the s-node label ends with REVERSE_SUFFIX ("-rev"), the argument roles should be swapped
# in a post-processing step to get the correct order of the arguments in the relation,
# i.e. (target, target, source) instead of (source, source, target).
print(rel.resolve())
print("YA-S2TA-Nodes:")
for rel in doc.ya_i2l_nodes.predictions:
if rel.label != "NONE":
print(rel.resolve())
print("YA-I2L-Nodes:")
for rel in doc.ya_s2ta_nodes.predictions:
if rel.label != "NONE":
print(rel.resolve())
print("done")
# Predictions:
# S-Nodes:
# ('Default Rephrase', (('source', ('L', 'AudienceMember 20210912QT02: did what we were told')), ('target', ('L', 'AudienceMember 20210912QT02: We followed the guidance'))))
# ('Default Inference-rev', (('source', ('L', 'AudienceMember 20210912QT02: It makes me sick')), ('target', ('L', "AudienceMember 20210912QT02: Then you hear they're having Christmas parties while we are suffering"))))
# ...
# YA-S2TA-Nodes:
# ('Asserting', (('source', ('L', 'AudienceMember 20210912QT02: My parents both had COVID')),))
# ('Asserting', (('source', ('L', 'AudienceMember 20210912QT02: We followed the guidance')),))
# ...
# YA-I2L-Nodes:
# ('Restating', (('source', ('L', 'AudienceMember 20210912QT02: We followed the guidance')), ('target', ('L', 'AudienceMember 20210912QT02: did what we were told'))))
# ('Arguing', (('source', ('L', "AudienceMember 20210912QT02: Then you hear they're having Christmas parties while we are suffering")), ('target', ('L', 'AudienceMember 20210912QT02: It makes me sick'))))
# ...
# done
# clone project
git clone https://github.com/ArneBinder/dialam-2024-shared-task.git
cd dialam-2024-shared-task
# [OPTIONAL] create conda environment
conda create -n dialam-2024-shared-task python=3.9
conda activate dialam-2024-shared-task
# install PyTorch according to instructions
# https://pytorch.org/get-started/
# install remaining requirements
pip install -r requirements.txt
# [OPTIONAL] symlink log directories and the default model directory to
# "$HOME/experiments/dialam-2024-shared-task" since they can grow a lot
bash setup_symlinks.sh $HOME/experiments/dialam-2024-shared-task
# [OPTIONAL] set any environment variables by creating an .env file
# 1. copy the provided example file:
cp .env.example .env
# 2. edit the .env file for your needs!
Have a look into the train.yaml config to see all available options.
Train model with default configuration
# train on CPU
python src/train.py
# train on GPU
python src/train.py trainer=gpu
Execute a fast development run (train for two steps)
python src/train.py +trainer.fast_dev_run=true
Train model with chosen experiment configuration from configs/experiment/
python src/train.py experiment=conll2003
You can override any parameter from command line like this
python train.py trainer.max_epochs=20 datamodule.batch_size=64
Start multiple runs at once (multirun):
python src/train.py seed=42,43 --multirun
Notes:
logs/multirun/
, see the last logging output for the exact pathThis will evaluate the model on the test set of the chosen dataset using the metrics implemented within the model. See config/dataset/ for available datasets.
Have a look into the evaluate.yaml config to see all available options.
python src/evaluate.py dataset=conll2003 model_name_or_path=pie/example-ner-spanclf-conll03
Notes:
trainer=gpu
to run on GPUThis will run inference on the given dataset and split. See config/dataset/ for available datasets.
The result documents including the predicted annotations will be stored in the predictions/
directory (exact
location will be printed to the console).
Have a look into the predict.yaml config to see all available options.
python src/predict.py dataset=conll2003 model_name_or_path=pie/example-ner-spanclf-conll03
Notes:
+pipeline.device=0
to run the inference on GPU 0This will evaluate serialized documents including predicted annotations (see Inference) using a document metric. See config/metric/ for available metrics.
Have a look into the evaluate_documents.yaml config to see all available options
python src/evaluate_documents.py metric=f1 metric.layer=entities +dataset.data_dir=PATH/TO/DIR/WITH/SPLITS
Note: By default, this utilizes the dataset provided by the
from_serialized_documents configuration. This configuration is
designed to facilitate the loading of serialized documents, as generated during the Inference step. It
requires to set the parameter data_dir
. If you want to use a different dataset,
you can override the dataset
parameter as usual with any existing dataset config, e.g dataset=conll2003
. But
calculating the F1 score on the bare conll2003
dataset does not make much sense, because it does not contain any
predictions. However, it could be used with statistical metrics such as
count_text_tokens or
count_entity_labels.
# run pre-commit: code formatting, code analysis, static type checking, and more (see .pre-commit-config.yaml)
pre-commit run -a
# run tests
pytest -k "not slow" --cov --cov-report term-missing
@inproceedings{binder-etal-2024-dfki,
title = "{DFKI}-{MLST} at {D}ial{AM}-2024 Shared Task: System Description",
author = "Binder, Arne and
Anikina, Tatiana and
Hennig, Leonhard and
Ostermann, Simon",
editor = "Ajjour, Yamen and
Bar-Haim, Roy and
El Baff, Roxanne and
Liu, Zhexiong and
Skitalinskaya, Gabriella",
booktitle = "Proceedings of the 11th Workshop on Argument Mining (ArgMining 2024)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.argmining-1.9",
doi = "10.18653/v1/2024.argmining-1.9",
pages = "93--102",
abstract = "This paper presents the dfki-mlst submission for the DialAM shared task (Ruiz-Dolz et al., 2024) on identification of argumentative and illocutionary relations in dialogue. Our model achieves best results in the global setting: 48.25 F1 at the focused level when looking only at the related arguments/locutions and 67.05 F1 at the general level when evaluating the complete argument maps. We describe our implementation of the data pre-processing, relation encoding and classification, evaluating 11 different base models and performing experiments with, e.g., node text combination and data augmentation. Our source code is publicly available.",
}