Closed oepen closed 5 years ago
during our call today, @danielhers came up with an alternative idea to making the MCES search over UCCA graphs more efficient, viz. to take better advantage of anchoring information when determining the initial solution. as i understand it, one could transitively project upwards in the graph the yields of all primary children, which would effectively associate each node with a set of character positions into the input string that the node dominates. using that information, one could then initialize the search (in initial_node_correspondences()
) with a node-to-node correspondence maximizing character overlap across all nodes. i am thinking sets of character positions here for robustness to both tokenization mismatches but also to mis-attached or un-attached children.
Before I did anything, I got these times on my computer:
$ time python3 -u ./main.py --n 100 --trace --trace --read mrp --score mces --gold data/score/ucca/ewt.gold.mrp data/score/ucca/ewt.tupa.mrp > tupa.log
real 9m29.177s
user 9m28.505s
sys 0m0.220s
$ grep all\": tupa.log
"all": {"g": 3235, "s": 3061, "c": 1678, "p": 0.548186867036916, "r": 0.5187017001545595, "f": 0.5330368487928843},
$ grep 500001 tupa.log |wc -l
46
And after some minor optimizations:
$ time python3 -u ./main.py --n 100 --trace --trace --read mrp --score mces --gold data/score/ucca/ewt.gold.mrp data/score/ucca/ewt.tupa.mrp > tupa.log
real 6m33.618s
user 6m33.307s
sys 0m0.145s
$ grep all\": tupa.log
"all": {"g": 3235, "s": 3061, "c": 1678, "p": 0.548186867036916, "r": 0.5187017001545595, "f": 0.5330368487928843},
$ grep 500001 tupa.log |wc -l
46
And with some anchoring heuristics:
$ time python3 -u ./main.py --n 100 --trace --trace --read mrp --score mces --gold data/score/ucca/ewt.gold.mrp data/score/ucca/ewt.tupa.mrp > tupa.log
real 6m1.426s
user 6m1.090s
sys 0m0.140s
$ grep all\": tupa.log
"all": {"g": 3235, "s": 3061, "c": 2052, "p": 0.6703691604050964, "r": 0.6343122102009273, "f": 0.6518424396442186},
$ grep 500001 tupa.log |wc -l
43
i am watching this thread with great interest :-).
the anchoring heuristics have yet to prove their value then? i was expecting i potentially dramatic improvement here. from the tracing of the search, does it (a) find its initial solution immediately (it should)? and does (b) it find additional solutions after that, within the limit?
This was the situation this morning:
$ time python3 -u ./main.py --n 100 --trace --trace --read mrp --score mces --gold data/score/ucca/ewt.gold.mrp data/score/ucca/ewt.tupa.mrp > tupa.log
real 5m35.491s
user 5m35.327s
sys 0m0.090s
$ grep all\": tupa.log
"all": {"g": 3235, "s": 3061, "c": 2052, "p": 0.6703691604050964, "r": 0.6343122102009273, "f": 0.6518424396442186},
$ grep 500001 tupa.log |wc -l
43
After implementing the first suggestion in this thread,
$ time python3 -u ./main.py --n 100 --trace --trace --read mrp --score mces --gold data/score/ucca/ewt.gold.mrp data/score/ucca/ewt.tupa.mrp > tupa.log
real 5m45.722s
user 5m45.599s
sys 0m0.101s
$ grep all\": tupa.log
"all": {"g": 3235, "s": 3061, "c": 2055, "p": 0.6713492322770337, "r": 0.6352395672333848, "f": 0.6527954256670903},
$ grep 500001 tupa.log |wc -l
40
so I commented it out for now.
hmm, that looks a bit underwhelming in terms of improved speed. i looked over the implementation, and it seems to do just what i had in mind. i find it very unintuitive, though, there should be no visible effect on pruning. have you confirmed that the trst actually does knock out some paths?
I haven't confirmed it yet. Still need to learn how to figure that out from the logs. tupa.log
i print()ed the dictionaries of transitive dominance relations (aka children) return from identify() (in 'ucca.py') and think that revealed a bug, which i believe i have fixed. i have not tested thoroughly, but on the first 20 items (with a limit of 100000 pairing attempts), i see a reduction in inexact solutions and a moderate speed-up, hence have tentatively turned on the dominance-based pruning (but remain surprised the gains seem comparatively limited).
from the in-depth analysis in #32, another structural constraint one could consider is about the special status of leaves, e.g. disallow correspondences of two nodes either based on incompatible out-degrees (both should be zero, or both non-zero) or presence of anchoring (both should be anchored, or both unanchored)?
Yes, it does seem to help. Still incremental, though.
$ time python3 -u ./main.py --n 100 --trace --trace --read mrp --score mces --gold data/score/ucca/ewt.gold.mrp data/score/ucca/ewt.tupa.mrp > tupa.log
real 5m30.150s
user 5m30.005s
sys 0m0.032s
$ grep all\": tupa.log
"all": {"g": 3235, "s": 3061, "c": 2090, "p": 0.6827834041163019, "r": 0.6460587326120556, "f": 0.6639135959339264},
$ grep 500001 tupa.log |wc -l
28
i believe we have come to view the dominance- and leaf-based structural constraints a desirable property in the MRP metric (and its search for node-to-node correspondences).
the MCES search for node-to-node correspondences is comparatively slow for UCCA, presumably because there is less node-local information to find a good initial estimate. #15 helped a lot for the bi-lexical graphs, so i am wondering whether one could do something similar for UCCA: for example, not pursue correspondences of nodes x and y in case there is a node i dominated by x whose correspondence is not dominated by y?
we have arguably made some progress in the past 24 hours, though my latest optimization appears to have reduced the number of inexact solutions but lowered the total number of matches:
Similarly, for my standard (87-item) EDS and (200-item) DM benchmarks: