Closed oepen closed 5 years ago
milan straka points out that there is also room for normalization of anchors in UCCA. because the framework in principle allows discontinuous anchors (unlike EDS), currently all units with multi-token anchors (e.g. a complex name like ‘Pierre Vinken’) are represented with multiple anchors. we should either collapse sequence of adjacent anchors (modulo whitespace) into one continous span during conversion, or treat the two representations as equivalent in evaluation, i.e. compare anchors somewhat robustly.
I think the solution should be to normalize this in evaluation, that is, treat
"anchors": [{"from": 0, "to": 6}, {"from": 7, "to": 13}]}
the same as
"anchors": [{"from": 0, "to": 13}]}
agreed; i have sketched a candidate solution in #20, hence closing this issue (which started out primarily about EDS).
actually, now that we can represent normalized anchors as sets of character positions (active in the UCCA scorer as of tonight), we should use the same representation in EDM (and possibly optimize it to use a bit vector instead of a frozenset()).
anchors in EDS reflect ERG tokenization assumptions, such that punctuation marks (treated as pseudo-affixes in the ERG) are often included. conversely, hyphenated token are internally split, but anchoring in those cases still betrays the underlying PTB-style tokenization. for increased robustness in evaluation, the tool should provide a ‘—trim’ option or the like, to normalize towards common assumptions. in principle, we might then want to re-release the EDS training data with normalized anchors.