UniversalConceptualCognitiveAnnotation / UCCA_English-EWT

http://bit.ly/UCCA_English-EWT
3 stars 1 forks source link

Tokenization divergences between UCCA and UD #2

Closed jakpra closed 5 years ago

jakpra commented 5 years ago

Some tokenizations don't match between UCCA and UD. E.g., in doc 020851, "Jack-s", paragraph-position 18 in UCCA / sentence 2 tokens 13-14 is split in UD, but one token in UCCA. Same with doc 020992, "#2", paragraph_position 25, doc 059005, "Max's", paragraph_position 3 and doc 059416, "Fraiser's", paragraph_position 18

This might be more complicated to fix as it affects annotations, but it would be nice if there is some way to make them match.

danielhers commented 5 years ago

@omriabnd wasn't the tokenization copied exactly from the STREUSLE files? For example, doc 020851 (task 1848 on UCCA-App) indeed treats "Jack-s" as one token, but there are two in UD. The old platform passage ID is 26238, and there, too, the token is "Jack-s".

danielhers commented 5 years ago

Fixed some of the differences. The following documents still require tokenization fixes:

020851
020992
057644
059005
059386
059416
200957
210066
211797
216456
217359
360937
399348

While 057644 and 059386 are in the dev set, all others are in the train set. They will all be corrected once annotation is finished.

danielhers commented 5 years ago

Remaining mismatches are only in train passages:

020851
020992
059005
059416
200957
210066
211797
216456
217359
360937
399348