UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

"as opposed to" #434

Open nschneid opened 11 months ago

nschneid commented 11 months ago

This is documented as fixed: https://universaldependencies.org/en/dep/fixed.html

What should be the UPOS of "as"? The data are inconsistent between ADP, ADV, SCONJ:

amir-zeldes commented 11 months ago

Should be IN/ADP+VBN/VERB+IN/ADP, since POS is tokenwise (so breaks down the same as "like contrasted with" IMO). Will fix GUM.

nschneid commented 11 months ago

Since "as" and "like" can in general be either ADP or SCONJ, could you clarify why you think ADP is better here?

nschneid commented 11 months ago

And also, should it depend on whether the fixed expression is functioning as case vs. mark?

amir-zeldes commented 11 months ago

Same as all English prepositions, ADP+case for adnominal, SCONJ+mark for clausal, no?

nschneid commented 11 months ago

(after some offline discussion with @amir-zeldes) To be clear, the issue is the word-level UPOS of "as" given that we mark the whole thing as fixed, which can function as a whole either as case or mark. Plain "as" can also function as case/ADP or mark/SCONJ. The PTB tagset which we use for XPOS doesn't distinguish these (IN = ADP+SCONJ) so it is a question of how the context should be taken into account for the first word of a fixed expression.

@dan-zeman any thoughts?

dan-zeman commented 11 months ago

If a node has a fixed dependent, it means that the node's UPOS does not (necessarily) reflect the word's position in the sentence. The UPOS that would correspond to the fixed expression as a whole may be different and it is not annotated in UD (except for optional MWEPOS or ExtPOS in MISC). The validator knows about this anomaly and skips most UPOS-incoming relation compatibility tests if it sees fixed among the outgoing relations. So I think you should not modify the UPOS of the first node of a fixed expression based on its DEPREL.

Independently of the above, I also think that a word that is prototypically an adposition can keep the ADP tag even if it occurs as a mark dependent. The validator should digest the opposite situation, too: as is perhaps prototypically SCONJ (at least for me) but it should be possible to attach it as case if needed. So I would probably choose only one UPOS category for as even outside fixed expressions.

nschneid commented 11 months ago

So, there's a good reason that for English, PTB merges prepositions and subordinators under one tag, IN: there is heavy lexical overlap between the more traditional ADP and SCONJ categories. Thus far we have been choosing UPOS based on context. We could go in a different direction, for example, with the goal of minimizing UPOS ambiguity per word, and allowing ADP/mark (perhaps also SCONJ/case). Not sure this is a high priority though.

If in general we resolve UPOS based on context, it leaves the UPOS of the first word of fixed underspecified. We could just default to ADP for words that can be prepositions.