UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

Deverbal connectives ("regarding", etc.) #179

Open nschneid opened 3 years ago

nschneid commented 3 years ago

The validator is complaining about a number of tokens that are tagged as VERB but attach as mark. These are for the connectives:

Tagged as VERB and serving as case we find in EWT and GUM:

And ADP/case:

Consulting OntoNotes 5, we additionally find the following acting like prepositions:

Examples:

How to analyze these? They are transparently related to verbs but grammaticized as connectives.

E.g. "regarding" and "provided" do not lend themselves to a relative clause paraphrase, though "concerning" may:

Should these be tagged as ADP/SCONJ rather than VERB, and if so, should the full word serve as the lemma?

I suppose some of these might be analyzed as subjectless VP adjunct constructions ("a rule concerning X": acl with X as the obj of "concerning"), but for many this seems counterintuitive.

amir-zeldes commented 3 years ago

I agree it's a bit ugly, but it's been this way for a long time so I've just kind of accepted it. But before thinking about how I feel about this moving forward, where are you seeing the validation errors? AFAIK GUM is currently valid despite following this practice:

http://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/validation-report.pl

nschneid commented 3 years ago

EWT validation errors: [L3 Syntax rel-upos-mark] 'mark' should not be 'VERB'

I can go through and correct those but first I need to know what the standard is. :)

nschneid commented 3 years ago

And also, should we allow VERB to attach as case? The validator doesn't complain but it seems like interpreting the connective as a preposition, so why not call it ADP?

amir-zeldes commented 3 years ago

Ooh, OK, so I just checked and it looks like GUM has these as VVG + SCONJ for that exact reason. So I guess yes, I would be on board with EWT doing the same :)

nschneid commented 3 years ago

OK then the lemma should be determined based on the UPOS, right? So "regarding", not "regard"?

amir-zeldes commented 3 years ago

I didn't say that... The xpos is still VBG, so from a PTB perspective it would be inconsistent to lemmatize to "regarding".

nschneid commented 3 years ago

Why should we take a PTB perspective in a UD corpus? :) I'm fine with having a verb XPOS but my impression was that morphology/lemma/deprel decisions should be consistent with each other and the XPOS could be a legacy thing.

Unless there are multiple SCONJes based on the same stem that we want to have the same lemma...it seems confusing to say that SCONJ "provided" is an instance of the lexical item "provide".

Some dictionaries with separate entries:

amir-zeldes commented 3 years ago

My concern is with backwards compatibility: for example, creating a diverging lemmatization standard means creating problems for concatenating training data from UD and non-UD corpora (or UD corpora you and I don't maintain) when training lemmatizers, or other tools which use lemmas as a feature.

nschneid commented 3 years ago

Right, and I'm all for backward compatibility when there are two equally valid choices. But we're already breaking backward compatibility in a sense when we tag it as SCONJ rather than VERB. Saying some of our SCONJes have verbal morphology/lemmas seems like it would surprise users relying on the UD standard. (It would surprise me, anyway, unless the SCONJ definition explained that in some languages there are subordinators that bear verbal morphology and receive a VerbForm feature/verb stem lemma.)

Looking at https://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/langspec/specify_feature.pl?lcode=en&feature=VerbForm, I do see a number of languages besides English that allow SCONJ to have a VerbForm feature, though it is certainly not most languages. And I wonder if it is by design from a UD perspective or if it is just to accommodate legacy tagging decisions. Maybe @dan-zeman can enlighten us.

If we wanted to keep XPOS-compatible lemmas as side information I suppose those could go in MISC.

amir-zeldes commented 3 years ago

we're already breaking backward compatibility in a sense when we tag it as SCONJ rather than VERB

No, not really, since there is no pre-UD standard of what the upos for these things should be - upos is a "UD era" thing, so backwards compatibility doesn't enter into it. But I have literally billions of tokens of searchable English data in GU web interfaces, tagged with PTB tags, where the lemma of "regarding" in all uses is "regard", and I'm sure I'm not the only one. So we would certainly be creating a diverging standard if we do this, and students or other users who learn to use one corpus will be surprised when they use another coming from somewhere else. Tools which are no longer re-trained and ship with ready models will sometimes output "old" style lemmatization and surprise users and downstream applications alike.

Saying that the lemma of "regarding" is "regard" is not so shocking for me and I'm used to it, so I can live with it better than considering the potential chaos of having to guess with each corpus I run into what kind of lemmatization (or mix of lemmatization) standards it will have...

nschneid commented 3 years ago

Being conservative about the lemmas might make life easier for working with other English corpora, but it might also make life harder for crosslinguistic analyses.

So I guess I'm asking about the scope of what UD is trying to standardize. Should the decisions about morphology/lemmas be strictly tied to the UPOS? Or can they be more loosely applied to reflect the etymology of a function word, for example? The status quo in GUM would amount to calling it morphologically-a-verb-syntactically-a-subordinator. If it turns out that lots of languages have a good reason to use VerbForm on SCONJ, maybe that's OK—I'd want it to be documented under SCONJ though.

dan-zeman commented 3 years ago

a number of languages besides English that allow SCONJ to have a VerbForm feature, though it is certainly not most languages. And I wonder if it is by design from a UD perspective or if it is just to accommodate legacy tagging decisions.

The language-UPOS-feature matrix was initialized by what was actually found in the treebanks at a given point in time. The reason was that I wanted to reduce the number of errors that must be fixed before the next data freeze. So it is possible to disallow VerbForm for SCONJ in English but initially it was allowed because it occurred at least once in the data. Maybe it wasn't a legacy tagging decision but a pure annotation error. From the (universal-level) UD perspective, any feature is allowed with any UPOS tag.

dan-zeman commented 3 years ago

a number of languages besides English that allow SCONJ to have a VerbForm feature, though it is certainly not most languages. And I wonder if it is by design from a UD perspective or if it is just to accommodate legacy tagging decisions.

I do think that students (as well as any other users) should be warned that corpora may differ in tokenization, lemmatization etc. they use, and I do my best to make the students alert about this.

And while I would not break backward compatibility just for fun, I don't think it has to block evolution forever, and I wouldn't hesitate to break it if I think that a different analysis fits UD principles better. Old-tools-that-are-no-longer-maintained are destined to vanish anyway, sooner or later.

That said, I don't think there is a UD guideline that would order or at least recommend that VerbForm be not used with derived SCONJ. (In fact, the guidelines specifically say that VerbForm can be used with non-verbs, but that statement involves borderline forms such as participles (could be ADJ instead of VERB), converbs (ADV), and verbal nouns.) Personally I would prefer lemma regarding and no features when it is tagged SCONJ. We have similar cases in Czech but I have not yet investigated whether they are annotated this way.

nschneid commented 2 years ago

From offline discussion, the consensus that emerged was to allow VERB/mark for these connectives. The validator will need to be updated, probably to require lexical lists of VERBs that serve as case or mark. For now, keep verbal features but use SCONJ to satisfy the validator.

nschneid commented 2 years ago

Another thing to address here is what CGEL calls the expandable construction—"considering that...", "provided that...", etc. (pp. 971, 982-983). CGEL calls these prepositions licensing that-clauses. In our terms that would be double-mark, with the root of the that-clause attaching as advcl. We already have something similar in EWT/GUM for a few instances of "except that", "given that", "save that". But "considering (that)" currently treats "considering" as an advcl predicate with a complement.

amir-zeldes commented 2 years ago

I agree double mark sounds OK here - prepositions can introduce clauses in general, so this seems fine. As for predicate + complement or mark, that depends somewhat on the context (there are more transparent cases, and we've seen that for clausal 'including' and other as well). As long as we're ideally consistent on which cases are analyzed as clausal it should be good. For non-clausal 'expandable' cases, +1 for double mark.

nschneid commented 2 years ago

One sentence has double-case "combined with" in EWT. Flagged by the validator because "combined_with" is not in the enhanced deprels list. Changing to ordinary verb analysis for now to avoid the error but we should really come up with a principled way to decide what's in this list of deverbal connectives.