UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

xpos GW is on first token #394

Closed amir-zeldes closed 1 year ago

amir-zeldes commented 1 year ago

I noticed EWT has the tag GW for the first token of a goeswith chain. I wouldn't normally advocate changing xpos tags, but it seems to me that this is an idiosyncracy of EWT, since GW does not exist in other LDC corpora (ON, PTB etc.). This leads to an odd inconsistency where the first token will carry the lemma and FEATS as expected, but the 'real' xpos (RB, NN, etc.) of the combination is on the second item (which has lemma '_'), and the correct lemma and upos is on the first token.

Should we swap all GW tags to be on the second item, with the real tag being moved back to the first item, like the lemma?

nschneid commented 1 year ago

I understood XPOS to follow the PTB standard, and that should include treating the last goeswith item as the "real" one, right? I agree the mismatch is a bit awkward but changing it would create an incompatibility with other PTB resources.

amir-zeldes commented 1 year ago

I don't see GW in PTB or OntoNotes, so I think it might just be EWT. Do you know another corpus that uses it?

nschneid commented 1 year ago

Switchboard, apparently: p. 10 of https://catalog.ldc.upenn.edu/docs/LDC2012T13/WebtextTBAnnotationGuidelines.pdf

amir-zeldes commented 1 year ago

Cool, I didn't know about that! Although, what does GW actually mean for Switchboard? Isn't it all spoken data?

nschneid commented 1 year ago

No idea. But presumably GW could appear in future LDC datasets as well.

amir-zeldes commented 1 year ago

OK, let's leave it alone then.