Check if UCCA alignments are off by one

alexanderkoller commented 5 years ago

In some UCCA sentences (e.g. 291046-0001), we suspect that the alignments may be off by one. Check this and fix if needed.

mariomgmn commented 5 years ago

yes, some of the alignments are off by one. This doesn't appear to be systematic. I'll fix it as soon as possible.

alexanderkoller commented 5 years ago

From @mariomgmn's email: "I think the problem with the weird alignments is coming from the function node_to_token_index which can be found in a_star_mrp.py which is then called in both convert_training_into_alto_corpus.py and convert_training_into_alto_corpus_w_edgeraising.py."

An example that exhibits the off-by-one problem is this, with the attached graph.

291046-0001
1
ucca
0.9
2019-05-18 (06:33)
0:4 5:7 8:16 17:20 21:24
Hams on Friendly ... RIP
hams on friendly ... rip
2!||3||1.0 5|1!||2||1.0 0!||1||1.0 7|4!||0||1.0 3!||4||1.0 
[5/Non-Terminal -A-> 2/friendly; 5 -S-> 1/on; 5 -A-> 0/hams; 7<root>/Non-Terminal -A-> 5; 7<root> -S-> 4/rip; 7<root> -U-> 3/...]

I think what may be happening here is that the first node in the graph refers to two tokens "ham" and "s", whereas the first token in the tokenized string consists of a single token "hams". Thus the Python code won't create an entry for node_to_index for node 0 in line 54 of a_star_mrp.py because no token matches the first span in the node. I don't quite understand the rest of the Python code, but it seems plausible that this would throw off the alignments.

Later on in the same function, you merge the anchoring spans of a node with each other (lines 63 ff.). Why don't you also do that in the first iteration?

Also, the function should probably be checked for places that assume that (a) all leaf nodes have been aligned to tokens in the first iteration (in this example, they were not) and (b) each leaf node is anchored in a single span (which node 0 is not).

tmpv1mvbv4qpdf.pdf

alexanderkoller commented 5 years ago

Works now and improved decomposability to 85%.

coli-saar / am-parser

Check if UCCA alignments are off by one #65