Closed amir-zeldes closed 1 year ago
I understood XPOS to follow the PTB standard, and that should include treating the last goeswith item as the "real" one, right? I agree the mismatch is a bit awkward but changing it would create an incompatibility with other PTB resources.
I don't see GW in PTB or OntoNotes, so I think it might just be EWT. Do you know another corpus that uses it?
Switchboard, apparently: p. 10 of https://catalog.ldc.upenn.edu/docs/LDC2012T13/WebtextTBAnnotationGuidelines.pdf
Cool, I didn't know about that! Although, what does GW actually mean for Switchboard? Isn't it all spoken data?
No idea. But presumably GW could appear in future LDC datasets as well.
OK, let's leave it alone then.
I noticed EWT has the tag GW for the first token of a
goeswith
chain. I wouldn't normally advocate changing xpos tags, but it seems to me that this is an idiosyncracy of EWT, since GW does not exist in other LDC corpora (ON, PTB etc.). This leads to an odd inconsistency where the first token will carry the lemma and FEATS as expected, but the 'real' xpos (RB, NN, etc.) of the combination is on the second item (which has lemma '_'), and the correct lemma and upos is on the first token.Should we swap all GW tags to be on the second item, with the real tag being moved back to the first item, like the lemma?