How to use the script for checking node alignments:

(1) Install SentenceTransformers with pip install sentence-transformers (see requirements.txt)

(2) Install textdistance with pip install textdistance (see requirements.txt)

(3) Run the script as follows:

$ python3 src/utils/align_i2l_nodes.py path_to_nodesets similarity_measure nodeset_id

For example:

$ python3 src/utils/align_i2l_nodes.py data cossim 17940

data is the path to the dataset with the nodesets in JSON format

cossim is the similarity measure to use, similarity measure can have the following values:

cossim: cosine similarity with SentenceTransformers (embedding-based)
jaccard: Jaccard index (token-based)
tversky: Tversky index (token-based)
sorensen: Sorensen-Dice coefficient (token-based)
tanimoto: Tanimoto distance (token-based)
overlap: Overlap coefficient (token-based)
bag: Bag distance (token-based)
lcsstr: Longest common substring similarity (sequence-based)
ratcliff_obershelp: Ratcliff-Obershelp similarity (sequence-based) For more details on textdistance metrics see: https://pypi.org/project/textdistance/ For more details on SentenceTransformers models see: https://www.sbert.net/docs/pretrained_models.html

21388 is the nodeset id (in this example for nodeset21388.json).

Note: If no nodeset id is provided, the script will compute the matches and mismatches between the I and L-ndoes for all nodesets in the directory.

Note: This also adds src/nodeset_utils.py with some helper methods.

First results

Usage: $ python3 src/utils/align_i2l_nodes.py path_to_nodesets similarity_measure (nodeset_id) Example: $ python3 src/utils/align_i2l_nodes.py data overlap

EDIT: Here are the results of comparing different methods with linear_sum_assignment suggested by Arne:

metric	Matched I-L	Accuracy
Overlap	14956	72.89
Bag	18440	89.87
Jaccard	18706	91.16
Tversky	18707	91.17
Sorensen	18902	92.12
Tanimoto	18921	92.21
all-mpnet-base-v2	19715	96.08
all-MiniLM-L6-v2	19747	96.24
all-distilroberta-v1	19802	96.51
Ratcliff-Obershelp	19980	97.37
LongestSubstring	20104	97.98

Note that in total we have 20519 I-L pairings. I tried to align I-to-L and not the other way around because some L nodes do not have any I-node alignments but all I-nodes must have a corresponding L-node (as far as I understand). To simplify the task we can assume that each I-node must be connected to a single L-node via YA-node. The gold data contain 44 L-nodes that are aligned to multiple I-nodes. I am not sure whether this is a bug or a feature but I think we can ignore such cases for now.

ArneBinder / dialam-2024-shared-task

Node alignment experiments #6

First results