Closed nschneid closed 4 days ago
N.B. Currently I have an EWT instance that is triggering a validator error: it is for the adpositional expression "due to" attaching as case
, only the "to" is omitted so there is no fixed
dependency. It would make sense to tag "due" as ADJ and ExtPos=ADP, but the validator needs to be updated to recognize the latter because it is not allowing an ADJ to attach as case
.
I sincerely do not see much utility in this, as for fixed
and most other cases this is already determined by the deprel: that is, ExtPos = expected POS of the deprel.
While I would be interested in discussing something similar when it is tied to an effective morphological strategy, e.g. in relation to VerbForm
.
for
fixed
and most other cases this is already determined by the deprel: that is, ExtPos = expected POS of the deprel.
Indeed, the ExtPos information is already implied if the deprel is correct and is a functional relation (cc
, case
, mark
, or advmod
). But there are cases where fixed
is used and the deprel is something less specific (e.g. pronouns can attach in a variety of ways), and in general, making ExtPos explicit highlights on the same line as the first word the fact that its UPOS does not control its deprel.
While I would be interested in discussing something similar when it is tied to an effective morphological strategy, e.g. in relation to
VerbForm
.
Could you elaborate?
But there are cases where fixed is used and the deprel is something less specific (e.g. pronouns can attach in a variety of ways)
Exactly, for example "each other" has ExtPos=PRON
but a variety of deprels.
for
fixed
and most other cases this is already determined by the deprel: that is, ExtPos = expected POS of the deprel.Indeed, the ExtPos information is already implied if the deprel is correct and is a functional relation (
cc
,case
,mark
, oradvmod
). But there are cases wherefixed
is used and the deprel is something less specific (e.g. pronouns can attach in a variety of ways), and in general, making ExtPos explicit highlights on the same line as the first word the fact that its UPOS does not control its deprel.But there are cases where fixed is used and the deprel is something less specific (e.g. pronouns can attach in a variety of ways)
Exactly, for example "each other" has
ExtPos=PRON
but a variety of deprels.
Annotation practices of course interfere with what would be the "expected" POS (ExpPos
:grimacing: ) for a dependency relation. But let's take each other as a specific example.
If I am not mistaken, this sequence would be labelled with ExtPos=PRON
because it is considered a MWE behaving as a whole as a reciprocal pronoun. This means that we expect it to get relations obj, nsubj, obl, iobj: all of these entail a nominal part of speech, so either NOUN
(+ PROPN
) or PRON
. The fact that this MWE is ascribable to PRON rather than NOUN derives from the fact that its "head" (and actually both elements) are of a synsemantic nature. But anyway, this would be an internal distinction to the fact of "behaving nominally". There are also other possible relations like conj
, orphan
, parataxis
... which are neutral with respect to parts of speech, so they are not relevant here.
From the data, there appear to be just these relations in English treebanks for each other. Now, imagine that it were annotated with relation advmod
. I am quite confident that in this case ExtPos
would be set to ADV; if not, the correctness of advmod
would be very doubtful (and in fact I think it would not be correct). This goes to show that ExtPos is a case of contextual annotation, as it is mechanically determined by the dependency relation: it is redundant and not useful. (Incidentally, I am very much against enforcing warnings from the annotator if this feature is to be annotated under MISC
.)
UPDATE: I recognise the following interpretation is faulty, I am sorry for this. I am toning it down but I am leaving it here for the more general points.
Now, still more specifically to each other is why it should be annotated as fixed
. It seems transparent: you have a contrastive element other modified by a distributive each, and this is a determinantal (or it might be argued, pronominal) phrase which behaves as any other nominal argument. I see that in some treebanks the "head" each gets the feature PronType=Rcp, which is problematic: if annotated at all, this should also go into MISC
, exactly as it has been proposed for ExtPos
. I think however that here we need to refer to a MWE annotation level and not let it percolate onto the morphosyntactic one.
By the way, the English case is quite different from the more or less corresponding Latin one, where we have a reciprocal element invicem: while this has transparent etymology in + vicem 'in [smb's/the other's] turn', it really looks crystallised and it does not appear where you would expect an oblique nominal phrase: you have it used as an obj
, or you have things like ab invicem 'from each+other', ad invicem 'to each+other', etc. (i.e., here you would have two adpositions). No reason to split it to have it again annotated as fixed
: this might appear on a derivational annotation layer, but it does not seem appropriate to the morphosyntactic one anymore.
So really I cannot see what ExtPos would add.
While I would be interested in discussing something similar when it is tied to an effective morphological strategy, e.g. in relation to
VerbForm
.Could you elaborate?
Here allow me to refer to my article Formae reformandae (UDW5). Traditionally, we have labels like participle, infinitive, supine, masdar, etc. to refer to particular forms in verbal paradigms whereby a verb gets to be used as a different part of speech, as it were. So, the participle is a verbal adjective whereby I can say (examples in Latin):
The form scriptura behaves in all like an adjective: inflection for gender/number/case, possibility of degree (scripturior, scripturissima), possibility of adverbialisation (scripture); but then also as a verb, in that can have the same argument structure: scriptura librum 'going to write a book', with accusative, instead of a nominal strategy like genitive *scriptura libri.
So, in the end using VerbForm=Part
would be equivalent toExtPos=ADJ
(in fact, I have proposed a notation like Transposed=ADJ
), but in this case this is tied effectively to morphology, and not to an invisible "global property" of a MWE.
Now, still more specifically to each other is why it [each other] should be annotated as
fixed
.
Discussions of English pronouns are at UniversalDependencies/docs#517, and docs at https://universaldependencies.org/en/pos/PRON.html. While it might be nice to show the historical origin of the expression with a relation other than fixed
, it seemed our best option to express the reciprocal slot of the pronoun paradigm was to use fixed
and treat the whole thing as PRON.
So really I cannot see what ExtPos would add.
Without ExtPos, how would one search a treebank for all expressions acting as pronouns? The rule would need to specify individual lexical items like "one another". But with ExtPos, it is easy to find the ones that are not PRON at the individual word level.
You mentioned conj
etc.: these are cases where it is not always trivial to detect the UPOS from the deprel. From English-GUM: "husbands are likely to laugh at jokes about wives and vice versa"—ExtPos is necessary to express that "vice versa" functions as an ADV (coordinated with an ADJ).
There also may be languages with fixed expressions functioning as PART, for example. PART is idiosyncratic and not necessarily predictable from the deprel.
So, in the end using VerbForm=Part would be equivalent to ExtPos=ADJ (in fact, I have proposed a notation like Transposed=ADJ), but in this case this is tied effectively to morphology, and not to an invisible "global property" of a MWE.
The line between VERB and ADJ can be tricky and I don't know enough about Latin to weigh in here (VerbForm=Part as used in English is NOT equivalent to occurring in ADJ-like environments), but yes, there may be many good uses of ExtPos beyond fixed expressions.
UPDATE: I know that in the haste of writing I put forth a faulty interpretation of English each other , I am sorry (but I am leaving it there). This however does not invalidate the other points.
Anyway, this is yet another case where, if each other is indeed a unique word like Latin invicem, written separately just for the vagaries of orthography, I think a token with spaces could be welcome.
Without ExtPos, how would one search a treebank for all expressions acting as pronouns? The rule would need to specify individual lexical items like "one another". But with ExtPos, it is easy to find the ones that are not PRON at the individual word level.
One would look for all elements with nominal relations (nsubj
, obj
, nmod
, ...) and select those whose head falls into a synsemantic word class. If the head is not synsemantic, I would put in doubt the pronominality of the expression. Conversely, I don't think that we want to assign ExtPos=ADV
to phrases like gr. pro Kopf ~ 'each', lit. 'per head', or to any other oblique.
A similar thing has already to be performed to retrieve predicates: a word receiving advcl
, csubj
, etc. can well be a non-verb with an auxiliary. But I do not think that we want to assign ExtPos=VERB
to those occurrences. The relation already tells us that. On the other hand, it is interesting to know if a csubj
is headed by a verb form "mimicking" a NOUN
or an ADJ
.
I am somewhat worried that a feature like ExtPos
could go out of hands and be very much misinterpreted by new annotators, as it already happens for fixed
.
You mentioned
conj
etc.: these are cases where it is not always trivial to detect the UPOS from the deprel. From English-GUM: "husbands are likely to laugh at jokes about wives and vice versa"—ExtPos is necessary to express that "vice versa" functions as an ADV (coordinated with an ADJ).
This is a general problem which goes beyond the appropriateness of annotating ExtPos
.
In this specific case, the issue has to be solved by addressing how to mark the presence of an ellipsis and/or the nature of vice versa: the annotation as ADV
is a confusing factor here (in the sense that it does not look like the right solution, at least not to me). Annotating ExtPos
here does not add anything, if possible it makes it even more confusing (I would immediately go look into the data to understand what justifies this asymmetry).
There also may be languages with fixed expressions functioning as PART, for example. PART is idiosyncratic and not necessarily predictable from the deprel.
We would need some example to discuss this. Anyhow, PART
is rather restricted in what it can be associated to. Another point is that it is this idiosincraticity of PART
annotation the problem we have to address.
So, in the end using VerbForm=Part would be equivalent to ExtPos=ADJ (in fact, I have proposed a notation like Transposed=ADJ), but in this case this is tied effectively to morphology, and not to an invisible "global property" of a MWE.
The line between VERB and ADJ can be tricky and I don't know enough about Latin to weigh in here (VerbForm=Part as used in English is NOT equivalent to occurring in ADJ-like environments), but yes, there may be many good uses of ExtPos beyond fixed expressions.
It really is the same in any Indo-European language (and beyond). What are non-ADJ-like environments of English VerbForm=Part
(which should at the same time be non-VERB-like)? If it were so, could I dare to suggest that this annotation might need some revision from a typological point of view?
But the point is, transposition exists and a unified way to mark it could be useful.
I am somewhat worried that a feature like
ExtPos
could go out of hands and be very much misinterpreted by new annotators, as it already happens forfixed
.
Some treebanks are already using ExtPos
. Treebanks are free to innovate with MISC attributes. As far as the validator is concerned, the only change will be for fixed
expressions (and it will be a warning not an error). If there is enthusiasm for a broader definition of ExtPos
down the road, that might lead to new guidelines, but I think that would be premature at this point.
I suspect requiring ExtPos
on fixed expressions might actually encourage treebanks to reduce their use of fixed
, because they will realize that most semantic multiword expressions can be accommodated by syntactically regular deprels (but we'll see).
It seems to me that most of your objections above are actually objections to the fixed
analysis in the first place. I don't want to bog down this thread with debates about particular expressions, but given that the relation exists to capture grammatical words-with-spaces, it doesn't seem like there is much harm in assigning those a holistic tag (even if it is sometimes inferable from the deprel, just as ADP, ADV, CCONJ, SCONJ are usually inferable from the deprel for single words). Explicitly flagging, e.g. for "rather" in "rather than", that it is an ADV internally and part of a CCONJ expression externally (rather than some other anomaly leading to ADV/cc) seems like it would help treebank users see what is going on.
I don't even understand why there is a discussion about the relevancy of ExtPos. ExtPos is just as relevant as upos, not more, not less. @Stormur if you said that ExtPost can be inferred from the syntactic relation, the same could be said about upos. (I don't think it is true that the POS can be inferred from the syntactic relation but that's not the point.) And even if it could be inferred, what is the problem to add ExtPos? I really don't understand the point.
One of the reasons we introduced ExtPos (apart the fact that in SUD our syntactic relations are less redundant with upos) is that it was difficult to track down the annotation errors or to find strange constructions because we add many unexpected pairs upos-relations. It is possible with Grew-match to search elements that have ExtPos=ADV or if no ExtPos, upos=ADV and then to get all the ADVs of one or several tokens (if you ExtPos on all fixed expressions as in French treebanks).
No problems in using it if one sees fit to do that, but only with making it more or less mandatory with warnings from the validator. I am contrary to that.
Then, my personal considerations about its utility still stand.
if you said that ExtPost can be inferred from the syntactic relation, the same could be said about upos
I think it is slightly different in that I do not envision ExtPos
for fixed being other than contextual, more or less by definition given its "externality".
While in general it is true we are interested to see whether, say, an nmod
is realised by a NOUN
,/PROPN
, PRON
, ADJ
, DET
, NUM
, VERB
with a VerbForm
... but in those cases, we have a syntactic word which does show characteristics of that word class.
It seems to me that most of your objections above are actually objections to the
fixed
analysis in the first place.
This is for sure a very big problem.
if you said that ExtPost can be inferred from the syntactic relation, the same could be said about upos
I think it is slightly different in that I do not envision
ExtPos
for fixed being other than contextual, more or less by definition given its "externality".
I think it does not have to be that way. If I have to add ExtPos
to all fixed
expressions in a treebank using a script, the script will not look at the context and make inferences like "the incoming deprel is advmod
, hence ExtPos=ADV
". Instead, the script will have a list of the fixed expressions in the language and a "dictionary" UPOS for each of them. I may discover expressions that are currently fixed
but I do not want them on the list, so I will change their annotation. And after I apply the script, I may ask the validator whether some of them occurred in a context that is not compatible with its new ExtPos
, and fix the annotation if it does.
I may discover expressions that are currently
fixed
but I do not want them on the list, so I will change their annotation. And after I apply the script, I may ask the validator whether some of them occurred in a context that is not compatible with its newExtPos
, and fix the annotation if it does.
I understand, but this is independent from ExtPos
and based just on a query for fixed
...
I may discover expressions that are currently
fixed
but I do not want them on the list, so I will change their annotation. And after I apply the script, I may ask the validator whether some of them occurred in a context that is not compatible with its newExtPos
, and fix the annotation if it does.I understand, but this is independent from
ExtPos
and based just on a query forfixed
...
Yes, there is definitely extra work required. But if the validator is modified to take ExtPos
into account, some of its current tests can be applied. The current state is that if the validator sees a fixed
child, it will turn off many of its UPOS-DEPREL compatibility tests.
I may discover expressions that are currently
fixed
but I do not want them on the list, so I will change their annotation. And after I apply the script, I may ask the validator whether some of them occurred in a context that is not compatible with its newExtPos
, and fix the annotation if it does.I understand, but this is independent from
ExtPos
and based just on a query forfixed
...Yes, there is definitely extra work required. But if the validator is modified to take
ExtPos
into account, some of its current tests can be applied. The current state is that if the validator sees afixed
child, it will turn off many of its UPOS-DEPREL compatibility tests.
But this would be an extra test created from nothing, from the addition of this feature which itself can only be added on contextual grounds as by definition it cannot depend on the characteristics of the single components. Because if it would, then why fixed
? And so it all boils down again to just checking all fixed
combinations, whatever their dependency relations.
There is circularity here. I also fear that making ExtPos
de facto mandatory would lead to an increase of fixed
expressions in new annotation endeavours, as in a sense this would justify the use of fixed
more than it is warranted (while we actually need the opposite, I think).
Now I will sit silent because I think I have already insisted too much on these points (sorry) and I am becoming grumpy and repetitive. But do not get me wrong, I can understand the implementation of tests like the ones you describe. However, all in all, I believe that these possible benefits are extremely marginal at best and that drawbacks on the contrary are too many. I would like to see a different "angle of attack" to the issues that we are confronting here.
Today the Core Group discussed FEATS vs. MISC and voted that FEATS would be a better home for ExtPos. Most MISC attributes are optional and unregulated at the universal level; putting ExtPos in FEATS gives it greater visibility and is in keeping with existing practice by the SUD group. Another practical advantage is a clear home in the docs for universal + language-specific pages (e.g. https://universaldependencies.org/en/feat/ExtPos.html). The encouragement to document the different values of ExtPos with examples in each language may have the effect of promoting discussion of the appropriate scope of fixed
.
@dan-zeman has drafted a universal guidelines page: https://universaldependencies.org/u/feat/ExtPos.html
A couple of questions about French examples:
"de la" on https://universaldependencies.org/u/feat/ExtPos.html – is this consistent with the treebanks? I can't find examples in the data.
"plutôt que" on https://universaldependencies.org/u/dep/fixed.html – the treebanks are not consistent. Is this a clear example to use, and if so what should its ExtPos be for the example—ADP or CCONJ?
I took the French examples from the French documentation but I did not verify them in the French treebanks.
I switched the "plutôt que" example to a "bien que" example from one of the treebanks.
@sylvainkahane or @bguil, maybe you could confirm the "de la" example of ExtPos=DET
? Why would that not just be an ordinary ADP + DET combination?
here are all the values of ExtPos in the French GSD treebank: https://universal.grew.fr/?custom=66841dc7423dd. If you look at the DET value you find "de la" (and its variant "de l'"). Note that "de la" is not always an indefinite determiner, it can also be the combination of ADP "de" and the definite determiner "la".
The de la example occurs 9 times in Sequoia.
Ah I was querying for "la" as the lemma when it should be "le". OK I guess this is the partitive article construction. (Curious: Can "de la" ever be used on a subject? I mainly see it following a verb or preposition, where historically "de" might have acted as a preposition.)
Two anwers to @nschneid.
1) Yes "de la" is the partitive article. I don't like this notion, in fact it is just the indefinite article for massive nouns. Note that the plural indefinite article "des" is also a portmanteau "de+les".
2) Using an indefinite article in the subject position is not very felicitous in French: https://universal.grew.fr/?custom=668424e695391. When the subject is indefinite, we have a special construction. Rather than saying S V, we prefer "il y a S qui V" 'there is S that V", especially in spoken French: https://universal.grew.fr/?custom=668429b9e5348.
In terms of implementing this in English treebanks such as PUD, are we at the point of labeling sort of
etc, or not there yet?
FEATS would be a better home for ExtPos
Sounds good, will implement for GUM as well
In terms of implementing this in English treebanks such as PUD, are we at the point of labeling
sort of
etc, or not there yet?
This ExtPos policy applies to all fixed
expressions, if that's what you're asking. If there are questions about what counts as fixed
that should go in other issues.
Actually I just mean - are we now ready to label fixed
expressions in PUD, or is there a reason to wait for the standard to be finalized and/or the validator to be updated?
We're ready to implement! The validator is not updated yet (once it is there will be an official announcement of the new policy), but I've already implemented in EWT.
GUM is implemented too, just moved it to FEATS, should update the next push
Found some cases of up to
which may need a fixed
relation in EWT
Train section:
bundling together cheques of up to $1,000 from friends and family
but not up to the standards that I was told I should expect
the food was not up to par with the price tag
Test:
# text = I'll pay up to 200-250 for it if I have to.
Where is the line to draw for as X as
expressions? There are some marked in EWT, such as
**as well as** the fun filled social dance evening held every Saturday evening
I will often have **as many as** one per kitten
but then many others are not marked, such as
We should know **as much as** we can
There are several fixed
expressions marked in PUD which are not marked in EWT. Here are few:
not marked in EWT:
after all
After all, the internet is not a luxury
as if
photographs that looked **as if** they were from the 1970s
at best
**At best** it is naive and at worst it would yet again...
close to ... similar to "approximately"
Cairo had a population of **close to** half a million
in addition ... "furthermore"
**In addition**, statute determines the election of assembly of regions
Marked in PUD but not existing in EWT:
more or less:
The working time undertaken in this first hour is more or less equal to 45 minutes.
What about down to
in a phrase such as
# text = The horse I had posted about a couple weeks ago with the atrophied cheek muscles is down to his last resort for life.
incidentally, am happy that as a human, we have surgical options other than "shotgun" for deal with atrophied cheek muscles
next to
in EWT which possibly matches other next to
ExtPos
If sites next to you don't have what you want
the sea next to you
I throw a treat across the floor or even right next to her paw
place it next to the couch
the fish look better next to them
right next to the ice machine
I don't see much difference with those and the following:
First room had used tissues next to the bed
It is next to Gare du Nord
although certainly there might be some subtle differences
dev, not marked:
# text = We are staying next to the airport which is located next to BARTrail.
test, not marked:
# text = Place is next to carval and walmart.
@AngledLuffa these are great questions/observations about fixed
consistency. Could you please move them to separate issues as I'm sure some will require discussion?
Done
ExtPos
should be now also observed by the validator, so I think this issue can be closed.
The first post says
A description of how
ExtPos
should be used (at least for fixed, though treebanks may opt to use it for art titles, SYM, and so on). Where will this live? The MISC attributes page is a bit cluttered with treebank-specific/experimental attributes. Cf. https://universaldependencies.org/en/feat/ExtPos.html
I don't see ExtPos
at https://universaldependencies.org/misc (or anywhere else), so I suggest to keep this issue open until ExtPos
is properly documented.
Interestingly, MWEPOS is documented there and it is the only place where ExtPos
is mentioned, but it says "Ideally, these two attribute names should be merged into one!"
I don't see
ExtPos
at https://universaldependencies.org/misc (or anywhere else), so I suggest to keep this issue open untilExtPos
is properly documented. Interestingly, MWEPOS is documented there and it is the only place whereExtPos
is mentioned, but it says "Ideally, these two attribute names should be merged into one!"
Good point about MWEPOS in MISC. In fact, ExtPos
has been documented in the meantime, as the core group decided it will be in FEATS.
In the French treebanks, ExtPos has been used for SYMs used as NOUNs (%, €, etc), CCONJs (&), etc.; for foreign words or combinations of letters and numbers (upos=X) used as PROPNs; for ADVs used as PRONs (a special construction of French, ADV de NOUN, where the ADV is the syntactic head and cans also be used alone as a pronoun); and some other rarer examples.
The Core Group has agreed to encourage the use of
ExtPos
to specify how a word (or the expression/phrase it heads) functions with respect to its external deprel, where this may differ from the UPOS reflecting the word's morphology and dependents.In particular, the first word of every
fixed
expression should specifyExtPos
to reflect the UPOS that the whole expression would bear were it a single word.Some treebanks have already been using
ExtPos
in MISC, others in FEATS. @dan-zeman has said that MISC is the appropriate place as it reflects properties broader than a single word's morphology.I assume we need to the following:
fixed
, though treebanks may opt to use it for art titles, SYM, and so on). Where will this live? The MISC attributes page is a bit cluttered with treebank-specific/experimental attributes. Cf. https://universaldependencies.org/en/feat/ExtPos.htmlfixed
guidelines.