UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
201 stars 43 forks source link

Lexical refinement on edeprels #287

Open nschneid opened 2 years ago

nschneid commented 2 years ago

The inventory of possible lexical markers on nmod, obl, acl, advcl, and conj Enhanced relations is now specified for the validator at: https://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/langspec/specify_edeprel.pl?lcode=en

There are a number of errors in EWT and GUM, some of which require tweaking of the inventory, and others of which should be changed in the data. Let's use this issue to track the discussion.

nschneid commented 2 years ago

@dan-zeman When I try to allow conj:rather_than it says I need to specify a coordinating function. Can "instead of something" be considered a (sometimes) coordinating function? (see #182)

nschneid commented 2 years ago

For "in order {to, for...to, that}", fixed only covers the "in order" part. Should the edeprel be advcl:in_order (as currently in EWT), or should it incorporate the marker? GUM has advcl:in_order_to and advcl:in_order_for, or the latter could be advcl:in_order_for_to.

I see that for_to is included in the list, and used in GUM but not EWT.

dan-zeman commented 2 years ago

I am wondering whether this issue should reside under docs, as it is not just about EWT.

dan-zeman commented 2 years ago

@dan-zeman When I try to allow conj:rather_than it says I need to specify a coordinating function. Can "instead of something" be considered a (sometimes) coordinating function? (see #182)

I actually answered there (https://github.com/UniversalDependencies/UD_English-EWT/issues/182#issuecomment-1007948541) and now I see that you moved the question here. So here is a copy:

When I try to allow |conj:rather_than| it says I need to specify a coordinating function. Can "instead of something" be considered a (sometimes) coordinating function?

The last set of functions are intended for use with conj:

Paratactic relation

Conjunction (“and”)
Negative conjunction (“neither … nor”)
Disjunction (“or”)
Adversative (“but, yet”)
Inferential-reason (“for”)
Inferential-consequence (“so”)

Maybe we can say it's a disjunction? I thought it was a subordinator and using it with conj was an error. But I was not aware of this issue.

dan-zeman commented 2 years ago

I see that @nschneid has added where with the example "I know [where you live]" but I would argue that at least in this sentence, where is an adverb and it should be attached via advmod to live, so it should not be propagated to the edeprel. I recall that the analysis of where has been discussed somewhere but I did not find the issue now (I don't know whether it is under UD_English-EWT, under docs, or somewhere else). EDIT: Found it here: https://github.com/UniversalDependencies/UD_English-EWT/issues/88.

IMHO the same problem is wherever (2 occurrences in EWT, one as ADV, one as SCONJ, without any actual difference; 2 occurrence in GUM, both in wherever possible, both treated as SCONJ; I believe all of them should be ADV).

IMHO the same problem is whither (1 occurrence in GUM, should be ADV).

dan-zeman commented 2 years ago

In the pattern no choice but to do something, there is an acl attached to choice, with two markers, but and to. There are 2 occurrences in EWT (http://hdl.handle.net/11346/PMLTQ-QQAW) and one in GUM (http://hdl.handle.net/11346/PMLTQ-7RRP). For the enhanced relation, GUM uses acl:but_to, which is the mechanical default when multiple markers are present. However, here I actually like the approach of EWT, which uses only acl:to — IMHO a better indication that there is an infinitival adnominal clause. In any case, it should be harmonized (right now acl:but_to is not registered, leading to an error in GUM).

dan-zeman commented 2 years ago

@amir-zeldes : In GUM, M = 7.64 ± 1.12 is analyzed so that ± 1.12 is nmod of 7.64. Shouldn't binary mathematical operations be conj? (That would mean that we need enhanced conj:plus_minus rather than nmod:plus_minus.)

dan-zeman commented 2 years ago

Is it grammatical in English to omit the second as from as well as? GUM has one example: ... this project study gives solution to the problem of the society concerning environment, health and safety as well energy conservation ... To me it sounds like there should be as well as but the author forgot to complete it. If that's true, then the enhanced relation should be conj:as_well_as (which is already registered), not conj:as_well.

dan-zeman commented 2 years ago

Is it a good idea to augment enhanced deprels with foreign case markers in code-switched data? Example: GUM uses nmod:de and nmod:a in a sentence that is completely French: J' ai besoin de tout mon courage pour mourir à vingt ans!” I think it would be quite sufficient to stay with plain nmod in such cases.

Alternatively, the validator could be modified to also observe MISC Lang=fr for edeprels, as it currently does for auxiliaries and features. (But Lang=fr would have to be added to that sentence, it is not there at present.)

dan-zeman commented 2 years ago

Any criteria for deciding whether versus should be preposition or coordinator in English? All examples in EWT and GUM result in nmod:versus, except for one in GUM, which is conj:versus and probably is just an error that should be fixed. (Nevertheless, Reynolds and Pullum (2013) argue in 4.3 that the function of versus has shifted towards coordinator.)

dan-zeman commented 2 years ago

Is til an acceptable alternative spelling of till, or is it a typo that should be normalized to till? There is one obl:til in EWT and one in GUM; if it was obl:till, it wouldn't be reported as error.

dan-zeman commented 2 years ago

There are two instances of aka in EWT (http://hdl.handle.net/11346/PMLTQ-FRJ0). They are tagged as ADV, which seems suspicious to me. Neverthless, the noun they modify is attached to the preceding nominal as appos (in the first case; in the second the antecedent is missing), which I agree with.

GUM has six instances (http://hdl.handle.net/11346/PMLTQ-FAAK). They are tagged as ADP, which might be okay (either that or a conjunction). But the edeprel of the nominal is nmod:aka while I think it should be appos.

amir-zeldes commented 2 years ago

Thanks for finding all of these! My take on these is:

nschneid commented 2 years ago
dan-zeman commented 2 years ago
  • conj:rather_than is correct under the current dependency analysis of "rather than"

@nschneid has added it.

  • if the infinitive "to" is generally included, I think the correct edeprel is advcl:in_order_to for consistency

I will leave this one for the two of you to sort out (advcl:in_order_to had been registered but was later replaced with advcl:in_order by @nschneid, to also accommodate advcl:in_order_for).

  • If we had an advcl marked by "where" (specifying location of a predicate as an adjunct clause), it should behave like "if", and should be mark; "where" as advmod is a Stanford Dependencies thing which was phased out in UD AFAIK (this was actually one of the GUM <6 SD to UD rules); "wherever possible" etc. are correct at SCONJ, just like "if possible". "whither" would be advmod in a question, but otherwise not.

I never heard of phasing it out in UD but maybe it was some English-internal discussion. FWIW, the equivalents in Czech are treated as adverbs (and it is the same in the Prague treebanks, i.e., without any connection to Stanford Dependencies). I am convinced that a wh-adverb stays an adverb and occupies an adverbial position regardless whether it is a question, a complement clause, or an adverbial clause.

  • nmod:plus_minus -> conj:plus_minus - this seems reasonable, I can change it

OK, conj:plus_minus is now registered.

  • de etc - oh, this is a funny one! I guess the student here went above and beyond the call of duty and did a whole French tree! I feel a bit sad throwing it away, but it should probably just all be flat no? Or what do you think? I agree an English edep with de is not sensible and could just remove it, but I'm not sure what is the best overall solution.

I would definitely not remove the sentence because that would break the integrity of the document, but you probably did not mean that. I would also not necessarily flatten the tree; I think using UPOS X and flat:foreign is an option for annotators who cannot or do not want to annotate the foreign language, but actually annotating it following the foreign language guidelines is possible and some treebanks do it. But I would only use nmod here so that we do not have to register the foreign prepositions (if we relied on Lang=fr, we would still have to register it in the French list, which is now empty).

In contrast, I did register Latin et as an English conjunction because I thought et al. has been naturalized in English.

  • I think "aka" can either be analyzed fully, as replacing a clause headed by "known" (then it should be acl), or we can view it as a preposition-like thing replacing "as", in which case nmod. I'm not sure about appos... I guess I'm convinceable. What do you think @nschneid ?

Yeah, also known as could be a fixed multi-word preposition or conjunction (but would you treat it as such if it occurred in the corpus?) I don't think it disqualifies the nominal from being an apposition (semantically it indeed sounds like one). Actually, I had the same feeling about such as, that's why I did not include it in the first round of porting English edeprels. So maybe the two should have the same solution. But if you guys believe it has to be nmod, we can add nmod:aka to the list.

nschneid commented 2 years ago

In contrast, I did register Latin et as an English conjunction because I thought et al. has been naturalized in English.

This is subject to debate. @amir-zeldes thinks of et as a conjunction even in English, whereas I think of "et al." as a fixed phrase. Let's revisit after resolving "etc.".

nschneid commented 2 years ago

I would only use nmod here so that we do not have to register the foreign prepositions

What about nmod:FOREIGN? So that scripts will know the lack of a lexical refinement is not an error.

dan-zeman commented 2 years ago
  • An interesting case is if they quote something in another language and then give a translation, suggesting that the reader may not speak the other language.

That was actually the case of the French sentence I showed (the English translation came two sentences later).

dan-zeman commented 2 years ago

I would only use nmod here so that we do not have to register the foreign prepositions

What about nmod:FOREIGN? So that scripts will know the lack of a lexical refinement is not an error.

Lexical refinement is not present everywhere, so I don't think this is necessary. (And deprels are all-lowercase, so it would have to be nmod:foreign.)

nschneid commented 2 years ago

I would only use nmod here so that we do not have to register the foreign prepositions

What about nmod:FOREIGN? So that scripts will know the lack of a lexical refinement is not an error.

Lexical refinement is not present everywhere, so I don't think this is necessary. (And deprels are all-lowercase, so it would have to be nmod:foreign.)

There should almost always be a lexical refinement if the dependent has a case or mark dependent, right?

dan-zeman commented 2 years ago

There should almost always be a lexical refinement if the dependent has a case or mark dependent, right?

But if the scripts already check the presence of a case/mark dependent, then they can check whether it has Foreign=Yes in the features.

dan-zeman commented 2 years ago
  • It is inelegant to list but_to, from_under, from_above, etc., but it would remove information to simplify it to just one word. What about putting + in the edeprel in such cases, and the validator will accept any + combination of listed lexical items? This would also work for in_order+that, in_order+to, in_order+for+to.

I feel quite strongly against adding any more complexity to the internal logic of the deprels. In fact, I hope that in the distant future, we will be able to replace all these lexical labels with some semantic tags that will be portable across languages.

I'm actually quite fine with from_under because some Northeast-Caucasian languages have a morphological case with the same meaning. But in other cases I tend to think that only one of the function words directly relates to the nmod relation. "But to" was one of such cases but I'm not sure I can specify cross-linguistic decision criteria where this should be done. (In Czech, I treated all combinations of jako 'as' + another preposition as if it was only jako; the same for než 'than'.)

amir-zeldes commented 2 years ago

I would opt for simplicity as well - edeps are a work in progress from my perspective, and messing with them too much right now may be premature optimization. I am happy with "from_above" for right now, and if infinitive "to" is in then it is in, meaning "but_to" (in the sense "except to") is also in.

I also don't think ":foreign" is necessary since there is Foreign, and this could lead to conflicts. And in any case, we have bare enhanced things like conj for zero coordination etc.

nschneid commented 2 years ago

The above changes (and recent additions to the validator list) result in EWT being VALID! A couple of items to note for future investigation:

The other changes were fairly straightforward.

amir-zeldes commented 2 years ago

Just one thing about comparative correlatives ("the more the merrier") - I'm all for advcl:the here, but that does open the question of what deprel and POS it should have. Currently it has det and DET because PTB tags DT (albeit sticking "the more" under a phrase node X, whatever that means). I think logically it should probably be IN/SCONJ/mark, but I'm not sure it's worth disrupting the PTB xpos ecosystem.

Options include:

  1. DT/DET/det/advcl:the
  2. DT/SCONJ/mark/advcl:the
  3. DT/DET/mark/advcl:the
  4. Other permutations?
  5. Just give up on advcl:the since this is a bit messy

Maybe option 2. is the best for preserving 'status quo' while expressing linguistic structure faithfully, though it does create an xpos/upos disparity (but an automatable one, since no other 'the' is deprel mark).

dan-zeman commented 2 years ago

Option 2 sort of makes sense to me. The only other viable option seems to be 5 (perhaps I'd even prefer that one). Because if we put the in the edeprel, we are saying that it functions as a marker rather than a determiner.

amir-zeldes commented 2 years ago

Because if we put the in the edeprel, we are saying that it functions as a marker rather than a determiner.

Yes, I think it is - historically it's a separate case form, distinct from the regular article (also compare the German form "desto", which is not the same as the regular article). It's only coincidentally a homonym of the article at this point, but really it's a totally different word morphosyntactically - it's labeled mark currently and I think that's correct, so it should be SCONJ too IMO.