cfmrp / mtool

Software to Manipulate Different Flavors of Semantic Graphs
http://mrp.nlpl.eu
GNU Lesser General Public License v3.0
51 stars 24 forks source link

normalize EDS anchoring #4

Closed oepen closed 5 years ago

oepen commented 5 years ago

anchors in EDS reflect ERG tokenization assumptions, such that punctuation marks (treated as pseudo-affixes in the ERG) are often included. conversely, hyphenated token are internally split, but anchoring in those cases still betrays the underlying PTB-style tokenization. for increased robustness in evaluation, the tool should provide a ‘—trim’ option or the like, to normalize towards common assumptions. in principle, we might then want to re-release the EDS training data with normalized anchors.

oepen commented 5 years ago

milan straka points out that there is also room for normalization of anchors in UCCA. because the framework in principle allows discontinuous anchors (unlike EDS), currently all units with multi-token anchors (e.g. a complex name like ‘Pierre Vinken’) are represented with multiple anchors. we should either collapse sequence of adjacent anchors (modulo whitespace) into one continous span during conversion, or treat the two representations as equivalent in evaluation, i.e. compare anchors somewhat robustly.

danielhers commented 5 years ago

I think the solution should be to normalize this in evaluation, that is, treat

"anchors": [{"from": 0, "to": 6}, {"from": 7, "to": 13}]}

the same as

"anchors": [{"from": 0, "to": 13}]}
oepen commented 5 years ago

agreed; i have sketched a candidate solution in #20, hence closing this issue (which started out primarily about EDS).

oepen commented 5 years ago

actually, now that we can represent normalized anchors as sets of character positions (active in the UCCA scorer as of tonight), we should use the same representation in EDM (and possibly optimize it to use a bit vector instead of a frozenset()).