ArneBinder / dialam-2024-shared-task

see http://dialam.arg.tech/
0 stars 0 forks source link

Nodesets with loops #15

Closed tanikina closed 2 months ago

tanikina commented 2 months ago

This adds a check whether any pair of nodes has several paths connecting them (i.e., A -> ... -> A loops). For example, see the following loop with multiple L-nodes in nodeset 18321:

Because of such loops when we do DFS we never add such nodes to the stack (because of the check for unvisited children here) and, as a result, these nodes are missing in the final nodeset after calling sort_nodes_by_hierarchy(). This version collects all such nodes and adds them to the stack here, so that after we process all the leaves we can also process such cases.

 $ python3 src/utils/nodeset2document.py --input_dir=data/train_dialam --nodeset_blacklist="24255, 24807, 24808, 24809, 24903, 24905, 24992, 25045, 25441, 25442, 25443, 25444, 25445, 25452, 25461, 25462, 25463, 25465, 25468, 25472, 25473, 25474, 25475, 21083, 18888, 23701, 18484, 17938, 19319, 25411, 25510, 25516, 25901, 25902, 25904, 25906, 25907, 25936, 25937, 25938, 25940, 26066, 26067, 26068, 26087, 17964, 18459, 19091, 19146, 19149, 19757, 19761, 19908, 21449, 23749, 25552, 19165, 22969, 21342, 25400, 21681, 23710, 19059, 19217, 19878, 20479, 20507, 20510, 20766, 20844, 20888, 20992, 21401, 21477, 21588, 23114, 23766, 23891, 19911"

Output before fix:

INFO:src.utils.nodeset_utils:Successfully processed 1366 nodesets (0 blacklisted).
Failed to process the following nodesets (34):
[('23837', KeyError('827426')), ('23892', KeyError('832523')), ('23799', KeyError('823663')), ('23809', KeyError('824537')), 
('19918', KeyError('642964')), ('25511', KeyError('605086')), ('19773', KeyError('633797')), ('18321', KeyError('543183')), 
('23391', KeyError('839247')), ('18877', KeyError('571892')), ('23688', KeyError('775764')), ('21275', KeyError('704307')), 
('23517', KeyError('776007')), ('19173', KeyError('595414')), ('19897', KeyError('641117')), ('21279', KeyError('704681')), 
('21022', KeyError('690585')), ('23849', KeyError('828412')), ('19174', KeyError('595605')), ('25528', KeyError('600711')), 
('23552', KeyError('691976')), ('21039', KeyError('693222')), ('23959', KeyError('840484')), ('20894', KeyError('685104')), 
('25691', KeyError('1027384')), ('23551', KeyError('767679')), ('20729', KeyError('674582')), ('23120', KeyError('665292')), 
('23560', KeyError('768013')), ('23144', KeyError('692819')), ('18874', KeyError('571629')), ('21023', KeyError('691976')), 
('18795', KeyError('566010')), ('23599', KeyError('766010'))]

EDIT: we get 34 failed nodesets if we don't check for any loops at all. If we check for self loops like here then we get 19 failed nodesets (and 1381 processed nodesets in total).

Output after fix:

INFO:src.utils.nodeset_utils:Successfully processed 1400 nodesets (0 blacklisted).
Failed to process the following nodesets (0): []

Now also visualize_arg_map.py displays the nodeset with loops correctly. Below is an example for nodeset 25511.

Visualization before fix: nodeset25511 gv

$ python src/visualization/visualize_arg_map.py data/train/ data/visualizations 25511

Console output before fix:
```console nodeset=25511: Missed I-nodes: {'605136', '605093', '605088', '605123', '605107', '605112', '605102', '605132', '605097', '605127'} nodeset=25511: Missed L-site nodes: {'605126', '1020390', '605101', '605111', '605095', '1020387', '1020405', '1020403', '1020401', '605105', '1020392', '605130', '1020400', '1020404', '1020394', '605086', '1020412', '605121', '605091', '605135', '1020388', '1020386'} nodeset=25511: Missed I-site nodes: {'605136', '1020413', '605093', '605088', '605123', '605132', '605102', '1020419', '1020414', '1020418', '1020420', '1020425', '605107', '605112', '1020421', '1020415', '1020416', '605097', '1020422', '1020417', '1020423', '605127'} nodeset=25511: Missed YA nodes: {'1020443', '1020426', '1020448', '1020445', '1020430', '1020435', '1020446', '1020440', '1020439', '1020450', '1020441', '1020438', '1020442', '1020427', '1020447', '1020434', '1020432', '1020444', '1020429', '1020428', '1020433', '1020431'} nodeset=25511: Missed edges: {('605126', '1020433'), ('605112', '1020417'), ('1020425', '605127'), ('1020390', '1020441'), ('1020417', '605107'), ('605105', '1020392'), ('605097', '1020413'), ('1020414', '605088'), ('1020418', '605102'), ('1020440', '1020415'), ('1020386', '605098'), ('1020438', '1020413'), ('1020446', '1020421'), ('605130', '1020405'), ('1020413', '605093'), ('1020387', '605091'), ('1020450', '1020425'), ('605126', '1020403'), ('1020388', '605101'), ('1020419', '605112'), ('1020421', '605136'), ('605130', '1020434'), ('605126', '1020412'), ('1020428', '605097'), ('1020426', '605088'), ('1020434', '605132'), ('1020394', '1020443'), ('1020421', '605127'), ('605111', '1020400'), ('1020392', '605111'), ('605093', '1020414'), ('1020412', '605144'), ('1020405', '605135'), ('1020448', '1020423'), ('1020441', '1020416'), ('1020442', '1020417'), ('605121', '1020401'),('1020443', '1020418'), ('605095', '1020388'), ('1020400', '1020444'), ('1020423', '605132'), ('605107', '1020416'), ('605086', '1020387'), ('605132', '1020421'), ('1020420', '605123'), ('1020401', '605126'), ('605095', '1020428'), ('605091', '1020386'), ('605102', '1020415'), ('1020439', '1020414'), ('605101', '1020429'), ('1020415', '605097'), ('1020429', '605102'), ('1020447', '1020422'), ('1020400', '605121'), ('605127', '1020420'), ('1020386', '605095'), ('605086', '1020426'), ('1020416', '605102'),('605135', '1020435'), ('1020435', '605136'), ('1020392', '1020442'), ('605135', '1020403'), ('1020403', '605130'), ('1020387', '1020439'), ('1020422', '605136'), ('1020386', '1020438'), ('1020412', '1020450'), ('1020427', '605093'), ('605091', '1020427'), ('605136', '1020423'), ('1020432', '605123'), ('605101', '1020390'), ('605112', '1020418'), ('1020405', '1020448'), ('1020404', '1020447'), ('605101', '1020394'), ('605145', '1020425'), ('605135', '1020404'), ('1020388', '1020440'), ('1020390', '605105'), ('1020404', '605138'), ('1020401', '1020445'), ('1020430', '605107'), ('1020444', '1020419'), ('605121', '1020432'), ('1020445', '1020420'), ('605111', '1020431'), ('1020431', '605112'), ('1020433', '605127'), ('605105', '1020430'), ('605123', '1020419'), ('1020403', '1020446'), ('1020394', '605111'), ('605139', '1020422')} ```

Visualization after fix: nodeset25511 gv

Console output after fix:
```console nodeset=25511: I-nodes order mismatch: ['605132', '605145', '605088', '605112', '605097', '605093', '605127', '605139', '605123', '605102', '605107', '605136'] != ['605088', '605093', '605097', '605102', '605107', '605112', '605123', '605127', '605132', '605136', '605139', '605145'] ```

Note that the size of the training set has changed from 1381 to 1400. The blacklist and the test in tests/dataset_builders/pie/test_dialam2024.py have been updated accordingly.

ArneBinder commented 2 months ago

Remaining warnings regarding TA-loops:

nodeset_id=18321: Detected loop nodes: {'543218', '543226', '543222'}
nodeset_id=18795: Detected loop nodes: {'566024'}
nodeset_id=18874: Detected loop nodes: {'571641', '571658', '571679', '571673', '571714', '571683', '571653', '571646', '571688', '571663', '571668', '571636', '571631', '571629'}
nodeset_id=18877: Detected loop nodes: {'571917', '571930', '571923'}
nodeset_id=19173: Detected loop nodes: {'595568', '595573'}
nodeset_id=19174: Detected loop nodes: {'595636', '595696', '595648', '595640'}
nodeset_id=19773: Detected loop nodes: {'633816'}
nodeset_id=19897: Detected loop nodes: {'641279', '641301', '641289', '641269', '641257', '641317', '641250'}
nodeset_id=19918: Detected loop nodes: {'642982', '642991', '642987'}
nodeset_id=20729: Detected loop nodes: {'674593', '674586', '674582'}
nodeset_id=20894: Detected loop nodes: {'685146'}
nodeset_id=21022: Detected loop nodes: {'690585'}
nodeset_id=21023: Detected loop nodes: {'692003', '691998', '691979', '691976', '691991'}
nodeset_id=21039: Detected loop nodes: {'693222'}
nodeset_id=21275: Detected loop nodes: {'704348'}
nodeset_id=21279: Detected loop nodes: {'704689'}
nodeset_id=23120: Detected loop nodes: {'665292', '665296'}
nodeset_id=23144: Detected loop nodes: {'693005', '693036', '693021', '692990', '692968', '692947'}
nodeset_id=23391: Detected loop nodes: {'839308'}
nodeset_id=23479: Detected loop nodes: {'794844'}
nodeset_id=23517: Detected loop nodes: {'775891', '776007', '798681'}
nodeset_id=23533: Detected loop nodes: {'799381'}
nodeset_id=23551: Detected loop nodes: {'801243'}
nodeset_id=23552: Detected loop nodes: {'767719', '801330', '801302'}
nodeset_id=23560: Detected loop nodes: {'802482', '802475'}
nodeset_id=23599: Detected loop nodes: {'806770'}
nodeset_id=23688: Detected loop nodes: {'860200'}
nodeset_id=23696: Detected loop nodes: {'817050'}
nodeset_id=23789: Detected loop nodes: {'820709'}
nodeset_id=23799: Detected loop nodes: {'823663', '823659'}
nodeset_id=23809: Detected loop nodes: {'824652'}
nodeset_id=23837: Detected loop nodes: {'827426'}
nodeset_id=23849: Detected loop nodes: {'828443'}
nodeset_id=23853: Detected loop nodes: {'828982'}
nodeset_id=23878: Detected loop nodes: {'831523'}
nodeset_id=23892: Detected loop nodes: {'832595', '832589'}
nodeset_id=23959: Detected loop nodes: {'840574', '840569', '840579'}
nodeset_id=25511: Detected loop nodes: {'605130', '605135'}
nodeset_id=25526: Detected loop nodes: {'601506'}
nodeset_id=25528: Detected loop nodes: {'600819'}
nodeset_id=25691: Detected loop nodes: {'1027474', '1027462', '1027457', '775927', '775933', '1027467'}
nodeset_id=25723: Detected loop nodes: {'816292'}