ArneBinder / dialam-2024-shared-task

see http://dialam.arg.tech/
0 stars 0 forks source link

add `dialam2024` dataset loading script #14

Closed ArneBinder closed 2 months ago

ArneBinder commented 2 months ago

This adds the Huggingface (HF) and PyTorch-IE (PIE) dataset loading scripts for the dataset of the DialAM-2024 shared task, named dialam2024.

The HF script dataset_builders/hf/dialam2024/dialam2024.py

The HF script downloads the data from the website of the shared task and filters it with our compiled blacklist. The script is also available at https://huggingface.co/datasets/ArneBinder/dialam2024 and can be used via

from datasets import load_dataset

ds = load_dataset("dataset_builders/hf/dialam2024", split="train")
# or directly from the hub. This can be called from everywhere if the "datasets" package is installed!
# ds = load_dataset("ArneBinder/dialam2024", split="train")
example = ds[0]
assert example["id"] == "17918"
assert example["nodes"]["text"][0].startswith("Claire Cooper : Even if some children")
assert example["nodes"]["type"][0] == "L"

Note that example contains the data from "nodeset17918.json" as a nested dictionary. But be aware, since Huggingface datasets is backed by tabular data storage, the sequential entries (nodes / edges / locution values) of the nodeset got converted to a dictionary of lists! You may want to convert it back to lists of dicts to use our previous code.

The PIE script dataset_builders/pie/dialam2024/dialam2024.py

The PIE dataset loading script wraps the HF script with a conversion to typed documents (text+annotation layers). As above, get the dataset via (note the different script location):

from datasets import load_dataset
from src.utils.nodeset2document import SimplifiedDialAM2024Document

ds = load_dataset("dataset_builders/pie/dialam2024", split="train")
# but the entires are documents
document = ds[0]
assert isinstance(document, SimplifiedDialAM2024Document)

# this is a span with referencing the document.text ...
first_l_node = document.l_nodes[0]
# ... with start and end offsets ...
assert first_l_node.start == 0
assert first_l_node.end == 214
# ... and a label.
assert first_l_node.label == "L"
# Get the respective text span by converting it to a string:
assert str(first_l_node).startswith("Claire Cooper : Even if some children")

Other Notes

Follow-Up