Closed ArneBinder closed 3 months ago
Usage: $ python3 src/utils/align_i2l_nodes.py path_to_nodesets similarity_measure (nodeset_id)
Example: $ python3 src/utils/align_i2l_nodes.py data overlap
EDIT:
Here are the results of comparing different methods with linear_sum_assignment
suggested by Arne:
metric | Matched I-L | Accuracy |
---|---|---|
Overlap | 14956 | 72.89 |
Bag | 18440 | 89.87 |
Jaccard | 18706 | 91.16 |
Tversky | 18707 | 91.17 |
Sorensen | 18902 | 92.12 |
Tanimoto | 18921 | 92.21 |
all-mpnet-base-v2 | 19715 | 96.08 |
all-MiniLM-L6-v2 | 19747 | 96.24 |
all-distilroberta-v1 | 19802 | 96.51 |
Ratcliff-Obershelp | 19980 | 97.37 |
LongestSubstring | 20104 | 97.98 |
Note that in total we have 20519 I-L pairings. I tried to align I-to-L and not the other way around because some L nodes do not have any I-node alignments but all I-nodes must have a corresponding L-node (as far as I understand). To simplify the task we can assume that each I-node must be connected to a single L-node via YA-node. The gold data contain 44 L-nodes that are aligned to multiple I-nodes. I am not sure whether this is a bug or a feature but I think we can ignore such cases for now.
How to use the script for checking node alignments:
(1) Install SentenceTransformers with
pip install sentence-transformers
(see requirements.txt)(2) Install textdistance with
pip install textdistance
(see requirements.txt)(3) Run the script as follows:
$ python3 src/utils/align_i2l_nodes.py path_to_nodesets similarity_measure nodeset_id
For example:
$ python3 src/utils/align_i2l_nodes.py data cossim 17940
data
is the path to the dataset with the nodesets in JSON formatcossim
is the similarity measure to use, similarity measure can have the following values:21388
is the nodeset id (in this example fornodeset21388.json
).Note: If no nodeset id is provided, the script will compute the matches and mismatches between the I and L-ndoes for all nodesets in the directory.
Note: This also adds
src/nodeset_utils.py
with some helper methods.