UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

Way to silence fixed-gap validator warning? #1003

Open nschneid opened 9 months ago

nschneid commented 9 months ago

The validator issues a warning if there are words intervening between elements of a fixed expression (https://github.com/UniversalDependencies/tools/blob/cf9d1ae087e01a0a8646d0352315528fcbfc3ab8/validate.py#L1868).

This is just a warning because there are some legitimate cases, either due to a systematic construction in a language or due to an exceptional sentence. Could there be a way to indicate this in the data so as to remove the warning? E.g. FixedGap=Yes in MISC.

Stormur commented 8 months ago

Why? I do not see how a gap inside a fixed expression can be justified. Following these warnings for Latin treebanks I just found loads of annotation errors or bad practices.

Do you have some examples? If something can come in between, then is this not a sign that the syntax is not "frozen" (as per guidelines) and so has to be made explicit?

amir-zeldes commented 8 months ago

In languages with Wackernagel particles, such as ⲇⲉ in Classical Greek or Coptic, fixed expressions can often be interrupted if they happen to stand in the first position in the sentence, simply because the enclitic particle has to appear in the second position. It would be strange to consider such expressions fixed except when they happen to begin a sentence which has such a particle. The placement of the particle in those cases is fully automatic and does not respect syntactic phrasal constructions as a constraint.

nschneid commented 8 months ago

Wackernagel particles are a clear place the exception is needed. The one case in EWT is "due largely to". It's a bit borderline, but when we last discussed this the consensus was that "due to" is sufficiently frozen to annotate it as such even if there are occasional internal modifiers.

jnivre commented 8 months ago

A Swedish example is ”för … sedan”, meaning ”… ago”, as in “för 20 år sedan”, meaning ”20 years ago”. You can insert any time expression between “för” (“for”) and “sedan” (“then”), but the combination of ”för” and ”sedan” is completely frozen and syntactically anomalous.

Joakim

Skickat från Outlook för iOShttps://aka.ms/o0ukef


Från: Nathan Schneider @.> Skickat: Thursday, December 14, 2023 7:44:19 PM Till: UniversalDependencies/docs @.> Kopia: Subscribed @.***> Ämne: Re: [UniversalDependencies/docs] Way to silence fixed-gap validator warning? (Issue #1003)

Wackernagel particles are a clear place the exception is needed. The one case in EWT is "due largely to". It's a bit borderline, but when we last discussed this the consensus was that "due to" is sufficiently fixed to annotate it as such even if there are occasional internal modifiers.

— Reply to this email directly, view it on GitHubhttps://github.com/UniversalDependencies/docs/issues/1003#issuecomment-1856395871, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZ7ZVUP7EPEE6KGKOZU5P3YJNCIHAVCNFSM6AAAAABAN3D3AKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJWGM4TKOBXGE. You are receiving this because you are subscribed to this thread.Message ID: @.***>

VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert. CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe.

När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy

Stormur commented 8 months ago

With regard to Wackernagel particles:

So something is wrong with such a fixed.

The placement of the particle in those cases is fully automatic and does not respect syntactic phrasal constructions as a constraint.

I would strongly argue against that (although I have no quick references at hand, I am sorry), given that we observe such particles also sometimes considering whole phrases, not only single elements. So they do interact with (or "respect") syntax.

A Swedish example is ”för … sedan”, meaning ”… ago”, as in “för 20 år sedan”, meaning ”20 years ago”. You can insert any time expression between “för” (“for”) and “sedan” (“then”), but the combination of ”för” and ”sedan” is completely frozen and syntactically anomalous.

This is an interesting and tricky case. But could a lexicocentric approach not favour tying both members to the head år instead of making them depend on each other? Is this not one of those cases that should be shifted to a different MWE level?

amir-zeldes commented 8 months ago

With regard to Wackernagel particles:... So something is wrong with such a fixed. ... should not be used for multiword expressions that are morphosyntactically flexible

I disagree with this - such cases are not morphosyntactically flexible, they retain both the same valency structure and the same constituent words exactly, with no change. The behavior of Wackernagel particles inserted in the middle of a fixed phrase can be explained on purely phonological grounds, when the first part of the fixed expression is stressed.

Else, the second-position rule would make the Wackernagel particle appear after the whole block (and this, in fact, happens).

Well, if it happens only some of the time, and we consider it a fixed expression there, shouldn't we want it to have the same structure even when a particle interrupts it? It means exactly the same thing, and the particle isn't dominated by either part of the fixed expression, indicating the "disrespect" of syntax I was referring to. For example, Greek εἰμή is generally considered a lexicon entry meaning "unless" and is often tokenized as one word. But historically it has two parts (<if not), and we can find both uninterrupted cases, annotated in UD as fixed, and rare interrupted ones (but only sentence initially, due to Wackernagel's Law). Here's such a case:

image "But unless you repent, I will come to you and remove..." ( Rev. 2:5) Lit.: If-but-not come.1.sg you.dat and remove.1.sg.fut ...

Notice that the intervening particle is dominated by the root from outside the fixed expression - it is not properly part of the phrase that it projects. I understand why the annotators would want such cases to be annotated in a way that is consistent with the much more common non-interrupted cases.

Stormur commented 8 months ago

It might not be the most common behaviur for such particles, but if does happen, than it is a sign that the presupposed fixed morphosyntax is actually not really fixed, but just is a very frequent linear co-occurrence of words. Because else we are just identifying some very frequent linear co-occurrences, retroactively declaring them fixed in perpetuo, and then treating phenomena against this as "exceptions" (as for due largely to), even complicating annotation as in these cases. In other words, the analysis is biased. It is similar in a way to contextual annotation.

Let's go to te specific Greek example you bring forth.

image "But unless you repent, I will come to you and remove..." ( Rev. 2:5) Lit.: If-but-not come.1.sg you.dat and remove.1.sg.fut ...

You say it yourself: what is orthographically a unit εἰμή consists of two syntactic words: εἰ + μή. Why fixed? One is acting as SCONJ/mark and the other one as PART/advmod:neg, each depending on the predicate. We might translate it as a single word "unless", but Greek by all means uses this combination "if not". The fact that we see δὲ appearing in between just strengthens the case against fixed. Why would we want to force a fixed, also non-projective annotation even against such evidence?! Then of course there are many reasons why they often appear adjacent (they are both words tending to appear at the left margin of the phrase) and why because of this at some point they might be represented as a single orthographic word... but that is just spelling. Let's choose the representation that better treats all the possible cases.

The same lines of reasoning can be gone through for 99,9% of the "gapped fixed" cases.

By the way, it does not seem that dictionaries converge on εἰμή being an autonomous lexical entry (it is not for the Lewis & Short, for example). Also the Perseus link is not producing anything.

Stormur commented 8 months ago

Another thought: it seems that fixed is taken to represent words which for some reason are spelt as separated elements, e.g. each other. Then it is difficult to admit that these units can be interrupted by other elements, if we take uninterruptibility as a reasonable sign of wordhood.

So, for example in Latin, we observe

but never ever

even if we can clearly identify a root dic and a TAM-person affix unt, the stress goes on dic-, etc.

Conversely we find:

where enim intervenes only after the phrase ADP+NOUN ad verba (in the second place with respect to the co-ordinated blocks), but this is of course not an argument to say that the ADP forms a fixed block, because among other we can find ad primam mulierem 'to the first woman', with some material in between ADP and NOUN. In this case it is an ADJ, but a discoursive PART is equally valid for this argument, all the more since we can observe the non-occurrence of cases like (2).

amir-zeldes commented 8 months ago

εἰμή consists of two syntactic words: εἰ + μή

@Stormur are you saying that etymology is always paramount? If so, how do we know that we should not divide unless (<on+less), or whatever (what+ever)? It's true that they are spelled together, but so is εἰ μή, at least in post-classical Greek, and "whatever" is even interruptible in "whatsoever". I am not a Greek lexicographer, but if Autenrieth treats it as a word I think he must have had a reason. Some possibilities that come to mind are the frequent omission of the verb next to it, the limitation of the meaning to a specific subset in that construction, the gradual death of μή as a negation despite the survival of this construction, and more. There is a whole discussion on it here for example.

I agree that dicunt is different, because "unt" is an inflectional suffix, and not a 'syntactic word' as discussed in #1006 . But UD Latin also has interrupted fixed expressions, for example si forte "perhaps", is interrupted in this example, here too due to enclisis, and is still annotated as fixed:

image

I think the view that any interruption, incl. by a structure not dominated by the fixed expression itself, disqualifies it as fixed, is rather Anglo (Euro?) centric - many languages have much freer word orders than English, and there the situation we see here can easily arise, where what is otherwise considered a fixed expression in similar or even related languages, can be separated by material solely due to phonological reasons, such as enclisis. I think languages should have leeway in choosing their own set of fixed expressions, and I don't think that a single split occurrence of what is otherwise a paradigm example of a fixed expression should prevent its use in the entire language as fixed.

nschneid commented 8 months ago

The fixed guidelines were recently revised taking into account input from MWE folks at Dagstuhl. If further substantive changes are to be considered I think it would have to go through UniDive.

The question for this issue is whether, given the current guidelines, it would make sense to tell the validator when a (generally rare kind of) annotation is intentional and shouldn't trigger a warning.

Stormur commented 8 months ago

εἰμή consists of two syntactic words: εἰ + μή

@Stormur are you saying that etymology is always paramount? If so, how do we know that we should not divide unless (<on+less), or whatever (what+ever)? It's true that they are spelled together, but so is εἰ μή, at least in post-classical Greek, and "whatever" is even interruptible in "whatsoever".

Etymology is not paramount, but it surely can be decisive in choosing some annotation strategies. In this case, though, I do not think it is even etymology, it is simply a composition of words. It is not at least in the sense in which we see that non 'not' in Latin is derived from ne unum 'not one', and I would never propose to split something similar (probably the same goes for unless). The interruptibility of whatever and the identification of so as an independent word is in fact an argument for splitting: it seems at least that whatever is not as much a word as dicunt.

A lexicographic entry like that by Autenrieth is not necessarily taking a stance that impacts on a UD-style annotation. I do not even think it is implying something towards wordhood, but just identifying a very common co-occurrence of two terms. We can surely observe an evolution of an expresion like εἰμή, but if a unitary treatment of its possible (and very often just supposed...) nuances still makes sense at a morphosyntactic level, I do not see much reason to let other factors interfere.

But UD Latin also has interrupted fixed expressions, for example si forte "perhaps", is interrupted in this example, here too due to enclisis, and is still annotated as fixed:

image

Hm, this was either left out from the reannotation of fixed we did in Latin treebanks, or lies in PROIEL (which we do not manage). But the whole syntax here is spurious, as is seen from quid depending as obj from SCONJ si, a nonsense. si forte is nothing, it does not even appear in dictionaries and it is simply a sequence meaning 'if perhaps'. You can put anything after a connective. The analysis as fixed is totally unwarranted and yes, the interruptibility by another element is just a confirmation of that.

I think the view that any interruption, incl. by a structure not dominated by the fixed expression itself, disqualifies it as fixed, is rather Anglo (Euro?) centric - many languages have much freer word orders than English, and there the situation we see here can easily arise, where what is otherwise considered a fixed expression in similar or even related languages, can be separated by material solely due to phonological reasons, such as enclisis. I think languages should have leeway in choosing their own set of fixed expressions, and I don't think that a single split occurrence of what is otherwise a paradigm example of a fixed expression should prevent its use in the entire language as fixed.

I could point out that it could be seen as the contrary: the tying together of word co-occurrences as fixed depending on regular, sometimes unchanging word orders is "Euro-centric" with respect to languages with supposedly freer word orders: that is, fixed goes in opposite direction as free word order, because if order is free, then maybe we want to recognise indepent syntactic behaviours of those words. But I do not see any X-centrism, maybe only a "dictionary-centrism" at times.

I would not give too much leeway to languages as much as I would see some suggestions in the guidelines of how to favour a non-fixed favouring annotation style.


The fixed guidelines were recently revised taking into account input from MWE folks at Dagstuhl. If further substantive changes are to be considered I think it would have to go through UniDive.

The question for this issue is whether, given the current guidelines, it would make sense to tell the validator when a (generally rare kind of) annotation is intentional and shouldn't trigger a warning.

I do not know how much they coincide, but for example PARSEME (as far as I understand from the papers) is also pushing towards a reanalysis of fixed shifting the "relation" between involved words to another, the MWE, level of annotation.

In any case I would not eliminate the warning as it points to many factual non-ideal annotations (as I think I have shown in the previous cases). Maybe I could envision an option for the validator to suppress warnings in general? A kind of "less strict validation"? But I would leave them there somewhere as possible reminders that, very probably, some interventio nneeds to be done.

amir-zeldes commented 8 months ago

si forte is nothing... A lexicographic entry like that by Autenrieth is not necessarily taking a stance ... whatever is not as much a word as dicunt.

I am not an expert on Latin, and it's very possible "si forte" is not a good candidate for fixed (indeed, Lewis has no such entry). I know Greek better and can see why this was done for εἰμή, especially when there is no verb, but I agree it's arguable, like many of the guideline decisions we make. Autenrieth is certainly not making UD decisions, but existence of a dictionary entry is in my view definitely a relevant argument when debating fixed expressions, but of course not the only one.

Ultimately, it's about consistency and knowing the language in question and its UD annotations in detail. I'm not really involved in those decisions for Greek or Latin, but I am for English, and I definitely don't want to split up "whatever", which is quite lexicalized and equivalent to a single 'syntactic word' in every function I can think of. Other English corpus designers have seen it the same way, so that's the English-specific decision - the Greek one can be similar or different, but it's not trivial to distinguish that "unless" is different from "εἰμή", or how many words "nevertheless", or "whatsoever" or "gonna" should all be.