UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
197 stars 41 forks source link

SCONJ should correspond to xpos=IN #457

Closed nschneid closed 8 months ago

nschneid commented 8 months ago

(This is relevant to adding Voice=Pass #290 as the validator doesn't allow SCONJ to have a Voice feature.)

nschneid commented 8 months ago

I guess the goeswith one is actually correct. Fixed the others in EWT.

amir-zeldes commented 8 months ago

OK, fixed GUM except for the VBGs, they're intentional; but that shouldn't matter if we're now not doing Voice=Act.

nschneid commented 8 months ago

I thought we decided the VBGs should be tagged VERB (but the deprel of mark is correct).

amir-zeldes commented 8 months ago

Oh I see, I guess that makes sense. I'll do that for GUM & co. too

nschneid commented 8 months ago

@amir-zeldes There's still a slight divergence between GUM and EWT here: EWT uses ADJ for "such" in the fixed expression "such as". (The expression as a whole functions as mark, so ExtPos=SCONJ would be appropriate.)

amir-zeldes commented 8 months ago

I see, OK, I can still change this. I guess the simplest solution for absolute parity would be to use the same upos script on both datasets, but I think there are some manual upos edits in EWT, right?

nschneid commented 8 months ago

I don't have a UPOS generation script for EWT—I have been editing whatever was in the .conllu over the years. If you wanted to run the GUM one on EWT that would make for an interesting comparison.

amir-zeldes commented 8 months ago

Sure, it's mostly just depedit, though it's possible there's some other tinkering going on in the buildbot, I'd have to look. It seems like a hassle to maintain both upos and xpos, and since I take it xpos is pretty high quality, I like to correct just that, and project to upos using the tree.

BTW GENTLE also has "such that", I guess we want that to be ADJ SCONJ? Currently they're annotated as mark sisters, since there is no fixed "such that" on the fixed list.

nschneid commented 8 months ago

Hmm, EWT just has 'are such that' which is not quite the same. When it acts as mark, I would probably include "such that" on the fixed list if "such as" is there. I don't think it makes sense to say that "such" can independently be mark.

nschneid commented 8 months ago

"so that" is also on the fixed list BTW. Cf. #400

amir-zeldes commented 8 months ago

Yeah, factually that all makes sense to me, I'm just very cautious about deviations from the list. So you want to canonize "such that" which only appears in GENTLE as fixed? In fairness, it does appear 7 times in two documents there (mathematical proofs)

nschneid commented 8 months ago

Yeah. There are even a few fixed expressions in EWT that never got added to the list for some reason. At some point we should discuss those too.

nschneid commented 8 months ago

(Implementation: UniversalDependencies/UD_English-GENTLE@2103015 UniversalDependencies/docs@b696742 )

amir-zeldes commented 8 months ago

OK, I changed GENTLE trees to match that, and added it to the fixed list in the UD guidelines and GUM wiki. I feel a bit bad for parsers testing on GENTLE, which is meant to be an OOD test set, since we just introduced a fixed expression that is totally absent from GUM/EWT, meaning it would be unreasonable to expect it to be predicted correctly... But such is life I suppose!

amir-zeldes commented 8 months ago

Thanks for adding the commits hashes - I think this is good to close!