add `dialam2024` dataset loading script

This adds the Huggingface (HF) and PyTorch-IE (PIE) dataset loading scripts for the dataset of the DialAM-2024 shared task, named dialam2024.

The HF script `dataset_builders/hf/dialam2024/dialam2024.py`

The HF script downloads the data from the website of the shared task and filters it with our compiled blacklist. The script is also available at https://huggingface.co/datasets/ArneBinder/dialam2024 and can be used via

from datasets import load_dataset

ds = load_dataset("dataset_builders/hf/dialam2024", split="train")
# or directly from the hub. This can be called from everywhere if the "datasets" package is installed!
# ds = load_dataset("ArneBinder/dialam2024", split="train")
example = ds[0]
assert example["id"] == "17918"
assert example["nodes"]["text"][0].startswith("Claire Cooper : Even if some children")
assert example["nodes"]["type"][0] == "L"

Note that example contains the data from "nodeset17918.json" as a nested dictionary. But be aware, since Huggingface datasets is backed by tabular data storage, the sequential entries (nodes / edges / locution values) of the nodeset got converted to a dictionary of lists! You may want to convert it back to lists of dicts to use our previous code.

The PIE script `dataset_builders/pie/dialam2024/dialam2024.py`

The PIE dataset loading script wraps the HF script with a conversion to typed documents (text+annotation layers). As above, get the dataset via (note the different script location):

from datasets import load_dataset
from src.utils.nodeset2document import SimplifiedDialAM2024Document

ds = load_dataset("dataset_builders/pie/dialam2024", split="train")
# but the entires are documents
document = ds[0]
assert isinstance(document, SimplifiedDialAM2024Document)

# this is a span with referencing the document.text ...
first_l_node = document.l_nodes[0]
# ... with start and end offsets ...
assert first_l_node.start == 0
assert first_l_node.end == 214
# ... and a label.
assert first_l_node.label == "L"
# Get the respective text span by converting it to a string:
assert str(first_l_node).startswith("Claire Cooper : Even if some children")

Other Notes

passing data_dir to load_dataset can be used to use local data instead of the official zip file. This works for the PIE and HF scripts. But note, that the blacklist gets still applied in both cases! If you want to try out blacklisted nodesets, you need to remove them from the list in the HF script (this gets forwarded to the PIE script) IMPORTANT: Changing the data_dir or the blacklist may produce caching issues, so if you expect that sth changes, but it does not, try deleting the Huggingface dataset cache (usually at $HOME/.cache/huggingface/datasets) and load the dataset again.
this also moves the method convert_to_document into the PIE script. nodeset2document.py can still be used as before to check the conversion, but I would recommend that we switch to real tests for any checks regarding the data.
this requires: #9

Follow-Up

[ ] simple training with taskmodule.relation_annotation=ya_i2l_nodes
[ ] add converter(s) that create nary_relations layer (1) from ya_i2l_nodes + ya_s2ta_nodes, or (2) from ya_i2l_nodes + s_nodes

ArneBinder / dialam-2024-shared-task