Closed nschneid closed 8 months ago
I guess the goeswith
one is actually correct. Fixed the others in EWT.
OK, fixed GUM except for the VBGs, they're intentional; but that shouldn't matter if we're now not doing Voice=Act.
I thought we decided the VBGs should be tagged VERB (but the deprel of mark
is correct).
Oh I see, I guess that makes sense. I'll do that for GUM & co. too
@amir-zeldes There's still a slight divergence between GUM and EWT here: EWT uses ADJ for "such" in the fixed expression "such as". (The expression as a whole functions as mark
, so ExtPos=SCONJ
would be appropriate.)
I see, OK, I can still change this. I guess the simplest solution for absolute parity would be to use the same upos script on both datasets, but I think there are some manual upos edits in EWT, right?
I don't have a UPOS generation script for EWT—I have been editing whatever was in the .conllu over the years. If you wanted to run the GUM one on EWT that would make for an interesting comparison.
Sure, it's mostly just depedit, though it's possible there's some other tinkering going on in the buildbot, I'd have to look. It seems like a hassle to maintain both upos and xpos, and since I take it xpos is pretty high quality, I like to correct just that, and project to upos using the tree.
BTW GENTLE also has "such that", I guess we want that to be ADJ SCONJ? Currently they're annotated as mark sisters, since there is no fixed "such that" on the fixed list.
Hmm, EWT just has 'are such that' which is not quite the same. When it acts as mark
, I would probably include "such that" on the fixed list if "such as" is there. I don't think it makes sense to say that "such" can independently be mark
.
"so that" is also on the fixed list BTW. Cf. #400
Yeah, factually that all makes sense to me, I'm just very cautious about deviations from the list. So you want to canonize "such that" which only appears in GENTLE as fixed? In fairness, it does appear 7 times in two documents there (mathematical proofs)
Yeah. There are even a few fixed expressions in EWT that never got added to the list for some reason. At some point we should discuss those too.
(Implementation: UniversalDependencies/UD_English-GENTLE@2103015 UniversalDependencies/docs@b696742 )
OK, I changed GENTLE trees to match that, and added it to the fixed list in the UD guidelines and GUM wiki. I feel a bit bad for parsers testing on GENTLE, which is meant to be an OOD test set, since we just introduced a fixed expression that is totally absent from GUM/EWT, meaning it would be unreasonable to expect it to be predicted correctly... But such is life I suppose!
Thanks for adding the commits hashes - I think this is good to close!
(This is relevant to adding
Voice=Pass
#290 as the validator doesn't allow SCONJ to have aVoice
feature.)