ArneBinder / dialam-2024-shared-task

see http://dialam.arg.tech/
0 stars 0 forks source link

Node alignment experiments #6

Closed ArneBinder closed 3 months ago

ArneBinder commented 3 months ago

How to use the script for checking node alignments:

(1) Install SentenceTransformers with pip install sentence-transformers (see requirements.txt)

(2) Install textdistance with pip install textdistance (see requirements.txt)

(3) Run the script as follows:

$ python3 src/utils/align_i2l_nodes.py path_to_nodesets similarity_measure nodeset_id

For example:

$ python3 src/utils/align_i2l_nodes.py data cossim 17940

data is the path to the dataset with the nodesets in JSON format

cossim is the similarity measure to use, similarity measure can have the following values:

21388 is the nodeset id (in this example for nodeset21388.json).

Note: If no nodeset id is provided, the script will compute the matches and mismatches between the I and L-ndoes for all nodesets in the directory.

Note: This also adds src/nodeset_utils.py with some helper methods.

ArneBinder commented 3 months ago

First results

Usage: $ python3 src/utils/align_i2l_nodes.py path_to_nodesets similarity_measure (nodeset_id) Example: $ python3 src/utils/align_i2l_nodes.py data overlap

EDIT: Here are the results of comparing different methods with linear_sum_assignment suggested by Arne:

metric Matched I-L Accuracy
Overlap 14956 72.89
Bag 18440 89.87
Jaccard 18706 91.16
Tversky 18707 91.17
Sorensen 18902 92.12
Tanimoto 18921 92.21
all-mpnet-base-v2 19715 96.08
all-MiniLM-L6-v2 19747 96.24
all-distilroberta-v1 19802 96.51
Ratcliff-Obershelp 19980 97.37
LongestSubstring 20104 97.98

Note that in total we have 20519 I-L pairings. I tried to align I-to-L and not the other way around because some L nodes do not have any I-node alignments but all I-nodes must have a corresponding L-node (as far as I understand). To simplify the task we can assume that each I-node must be connected to a single L-node via YA-node. The gold data contain 44 L-nodes that are aligned to multiple I-nodes. I am not sure whether this is a bug or a feature but I think we can ignore such cases for now.