UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

English nominal subtypes: merge :npmod and :tmod as :unmarked #1028

Open nschneid opened 6 months ago

nschneid commented 6 months ago

Because prepositions are so important in English, we have a well-established practice of distinguishing ordinary prepositional nmod and obl from other kinds via subtyping (nmod:poss, etc.).

In particular, nmod:tmod/obl:tmod have been used for non-prepositional temporal adjunct nominals like

in contrast to

tmod is part of the legacy of Stanford Dependencies. In light of current UD theory, it is an anomaly where the subtype reflects a semantic but not syntactic distinction (#893). Moreover, it is potentially confusing that only some temporal obliques (the prepositionless ones) receive the subtype.

Meanwhile, nmod:npmod/obl:npmod are used for OTHER non-prepositional adjunct nominals (in special constructions like "5 dollars a share" and "Shares eased a fraction). The term "npmod" (derived from the npadvmod relation in Stanford Dependencies) has been a source of confusion and invokes a concept of NP that is not part of UD theory.

A discussion amongst the core group concluded that a subtype named :unmarked would be a less confusing way to implement the adpositional vs. non-adpositional distinction, for languages that choose to do so.

@amir-zeldes and I plan to implement this for our English corpora, by simply renaming both :tmod and :npmod to :unmarked. Perhaps English-Atis (@aslikuzgun), English-ESLSpok (@kristopherkyle), English-{LinES, Pronouns, PUD} (@AngledLuffa), English-ParTUT (@msang) would like to do so as well for consistency.

nschneid commented 6 months ago

As this is a trivial change to implement, but one that multiple treebanks may want to make in concert, is it better to update EWT/GUM before May 1 or wait until the next release?

AngledLuffa commented 6 months ago

I'm not the right @ for LinES, but I can do it in the CoreNLP converter, PUD, and Pronouns

@LarsAhrenberg I can do it if you want me to do it to LinES

AngledLuffa commented 6 months ago

Is this just literally a string replace over everything?

The only : relations marked in Pronouns are aux:pass and det:predet. Another job well done

AngledLuffa commented 6 months ago

PUD has plenty. Please confirm if there's any intelligence required to do this, or just ESC-shift-5

nschneid commented 6 months ago

Simple replacement. Since EWT lacks any entity annotation whatsoever, for the :tmod ones I think I'll add TemporalNPAdjunct=Yes in MISC to retain the semantic information for posterity. Eventually we should annotate all temporal entities.

amir-zeldes commented 6 months ago

is it better to update EWT/GUM before May 1 or wait until the next release?

Not sure, time is a bit tight. And it's not just English, where I can update the GUM, Reddit and GENTLE repos - I know of at least UD Coptic and Hebrew IAHLTwiki which I maintain and use these labels, so I could change those, but I haven't coordinated with the annotators about this. Do you know if there are other datasets using these subtypes? I wouldn't want to create differences between datasets on short notice just for a renaming.

dan-zeman commented 6 months ago

nschneid commented 6 months ago

OK let's not rush it then. Let's implement it in the 2.15 release.

mr-martian commented 6 months ago

For Ancient Hebrew the usage of obl:npmod isn't "preposition-less non-temporal obl" but rather the construction argued about in #832, so I'd need a new label for those if there is to be an effort to eliminate :npmod in general.

amir-zeldes commented 6 months ago

@mr-martian I think obl:unmarked is about as informative/appropriate as obl:npmod, so you may as well switch too (not saying it's an ideal label, but the previous one also makes no sense in the context of dependencies)

nschneid commented 5 months ago

I started to draft a new issue about this, forgetting that this one existed. :D One bit of information not included above is the alternatives that were discussed, which I'll put for posterity:

nschneid commented 5 months ago

Implemented for EWT, and created some initial docs:

Still need to update more docs pages and mark old subtypes as deprecated.

What are implementation plans for other treebanks?

LarsAhrenberg commented 5 months ago

So far UD_English-LinES has used neither :npmod nor :tmod, but it seems quite straightforward to implement :unmarked so I put it up for version 2.15.

AngledLuffa commented 5 months ago

I made a PR for PUD. I don't think it's relevant for Pronouns

LarsAhrenberg commented 4 months ago

Reviewing the outputs of my script adding :unmarked to obl and nmod tokens I've come across a number of cases where I think the subrelation is reasonable but which are not covered in the initial docs ( oblique, nmod ). I would be grateful to hear the views of other people.

Multipart references to locations at number four, Privet Drive nmod:unmarked(four, Privet)

by way of Northfield , Minnesota nmod:unmarked(Northfield, Minnesota)

Apposition like but without identity of reference: blamed for letting the quality of life (a deplorable phrase) deteriorate nmod:unmarked(quality, phrase)

Subject: The cost of enlargement nmod:unmarked(Subject, cost)

Your amendments uphold two important principles: the right of rightholders to fair remuneration and the ... nmod:unmarked(principles, right)

Personal pronoun + noun I suppose you fellows remember... nmod:unmarked(you, fellows)

Go back to Stromboli, you dumb bastard nmod:unmarked(you, bastard)

Multi-word proper noun made adjective a tall Puerto Rican man. nmod:unmarked(man, Puerto), flat(Puerto, Rican)

Pre-head modifier like 'a couple' leather red with a suppleness to it that is part gift, part effort nmod:unmarked(gift, part), nmod:unmarked(effort, part)

Fronted or extraposed subject predicative A kibbutznik seaman, he has just returned from a voyage. obl:unmarked(returned, seaman)

These grew spontaneously one out of the other, obl:unmarked(grew, one)

Sound imitations Pop, would go one of the eight-inch guns; obl:unmarked(go, Pop) or maybe it should be obj(go, Pop)

nschneid commented 4 months ago

Sound imitations
Pop, would go one of the eight-inch guns; obl:unmarked(go, Pop) or maybe it should be obj(go, Pop)

"Pop" can't be omitted so it looks like obj to me (with an inverted word order; cf. 'Never!' said John).

Pre-head modifier like 'a couple' leather red with a suppleness to it that is part gift, part effort nmod:unmarked(gift, part), nmod:unmarked(effort, part)

Interesting...haven't thought about this one:

Multi-word proper noun made adjective a tall Puerto Rican man. nmod:unmarked(man, Puerto), flat(Puerto, Rican)

Because you can say "the man is Puerto Rican", I would lean toward treating the whole expression as an ADJ (ExtPos=ADJ). Thus: flat(Puerto/PROPN,ExtPos=ADJ Rican/ADJ) and amod(man, Puerto)

The rest have been discussed but not decided yet. See this paper for a synopsis and some proposals. If you want to contribute to the discussion: #455, UniversalDependencies/UD_English-EWT/issues/436, #751, #762, #933, #1024

amir-zeldes commented 3 months ago

OK, this change should now be done and documented for:

nschneid commented 3 months ago

Excellent!

Any updates regarding English-Atis (@aslikuzgun), English-ESLSpok (@kristopherkyle), English-ParTUT (@msang)? All of these use at least a subset of the {nmod:npmod, obl:npmod, nmod:tmod, obl:tmod} relations.

nschneid commented 2 months ago

I believe the English docs are now up to date, with mentions of :npmod and :tmod replaced with :unmarked.

I have not heard any objections to incorporating :unmarked into the remaining English corpora. @dan-zeman what is the policy regarding simple rule-based edits to other treebanks in the interest of within-language consistency?

dan-zeman commented 2 months ago

I have not heard any objections to incorporating :unmarked into the remaining English corpora. @dan-zeman what is the policy regarding simple rule-based edits to other treebanks in the interest of within-language consistency?

It depends. If I know that a treebank is actively maintained (or was in the not-so-distant past), like EWT, I would hesitate to touch it without the current maintainer's consent. If I know that the data provider / last maintainer has been silent for a long time, I would just go and fix it. Ideally the validator should flag it as a new error and the treebanks should get their four years grace period. But we currently have this mechanism only for the main guidelines, not for the language-specific relation subtypes.

dan-zeman commented 6 days ago

Is there a reason to keep this issue open or has everything been resolved?

nschneid commented 6 days ago

I think it's still open for Atis, ESLSpok, ParTut.

msang commented 4 days ago

Hi, just for the record, the latest release of ParTUT includes this change

nschneid commented 3 days ago

@amir-zeldes I just discovered in GUM a few stray enhanced edges with :tmod: https://universal.grew.fr/?custom=673bf3feccc5f

amir-zeldes commented 2 days ago

oh wow, whoops! Thanks for that, I'll clean them up upstream