[description] nodeset pre-processing pipeline

Nodeset Pre-processing Pipeline

To prepare nodesets we follow the steps as specified in src/utils/prepare_nodeset.py: 1) Clean up the nodeset by removing isolated nodes and invalid transitions: get_valid_src_trg_and_node_ids_from_relations() Only the following transitions are allowed (see get_relations() for more detail):

   I > {MA, RA, CA} > I
   L > TA > L
   {L, TA} > YA > {I, L, MA, RA, CA}

2) Remove S- and YA-nodes with edges: remove_s_and_ya_nodes_with_edges() 3) Add dummy S- and YA-nodes with edges by matching L- and I-nodes based on the similarity measure: add_s_and_ya_nodes_with_edges() a. Align I and L nodes based on the similarity of their texts: align_i_and_l_nodes() b. Create S nodes and align them with TA nodes by mirroring TA relations between L nodes to the aligned I nodes (see 3a): create_s_relations_and_nodes_from_ta_nodes_and_il_alignment() c. Create YA nodes and relations from I-L and S-TA alignments: add_s_and_ya_nodes_with_edges() 4) Optionally, add cleaned gold data for training (from the output of step 1): a. Normalize the direction of the RA-relation nodes: normalize_ra_relation_direction() b. Update the text and type of the result relation nodes with matching gold data: get_node_matching() c. Add remaining nodes and edges from the gold data that were not matched: merge_other_into_nodeset()

Graphical representation of the nodeset pre-processing pipeline:

The same drawing with clickable links (to see the corresponding methods) can be found here

After these pre-processing steps we prepare a SimplifiedDialAM2024Document for each nodeset in convert_to_document(): 1) Create document text and L-node-spans: link 2) Encode YA relations between I and L nodes (ya_i2l_nodes NaryRelation): link 3) Encode S relations between I nodes (s_nodes NaryRelation): link 4) Encode YA relations between S and TA nodes (ya_s2ta_nodes NaryRelation): link 5) Add all original data as metadata: link

ArneBinder / dialam-2024-shared-task

[description] nodeset pre-processing pipeline #38

Nodeset Pre-processing Pipeline