Tokenization divergences between UCCA and UD

UniversalConceptualCognitiveAnnotation / UCCA_English-EWT

http://bit.ly/UCCA_English-EWT

3 stars 1 forks source link

Tokenization divergences between UCCA and UD #2

Closed jakpra closed 5 years ago

jakpra commented 5 years ago

Some tokenizations don't match between UCCA and UD. E.g., in doc 020851, "Jack-s", paragraph-position 18 in UCCA / sentence 2 tokens 13-14 is split in UD, but one token in UCCA. Same with doc 020992, "#2", paragraph_position 25, doc 059005, "Max's", paragraph_position 3 and doc 059416, "Fraiser's", paragraph_position 18

This might be more complicated to fix as it affects annotations, but it would be nice if there is some way to make them match.

danielhers commented 5 years ago

@omriabnd wasn't the tokenization copied exactly from the STREUSLE files? For example, doc 020851 (task 1848 on UCCA-App) indeed treats "Jack-s" as one token, but there are two in UD. The old platform passage ID is 26238, and there, too, the token is "Jack-s".

danielhers commented 5 years ago

Fixed some of the differences. The following documents still require tokenization fixes:

While 057644 and 059386 are in the dev set, all others are in the train set. They will all be corrected once annotation is finished.

danielhers commented 5 years ago

Remaining mismatches are only in train passages: