Open nschneid opened 3 years ago
I agree it's a bit ugly, but it's been this way for a long time so I've just kind of accepted it. But before thinking about how I feel about this moving forward, where are you seeing the validation errors? AFAIK GUM is currently valid despite following this practice:
http://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/validation-report.pl
EWT validation errors: [L3 Syntax rel-upos-mark] 'mark' should not be 'VERB'
I can go through and correct those but first I need to know what the standard is. :)
And also, should we allow VERB
to attach as case
? The validator doesn't complain but it seems like interpreting the connective as a preposition, so why not call it ADP
?
Ooh, OK, so I just checked and it looks like GUM has these as VVG + SCONJ for that exact reason. So I guess yes, I would be on board with EWT doing the same :)
OK then the lemma should be determined based on the UPOS, right? So "regarding", not "regard"?
I didn't say that... The xpos is still VBG, so from a PTB perspective it would be inconsistent to lemmatize to "regarding".
Why should we take a PTB perspective in a UD corpus? :) I'm fine with having a verb XPOS but my impression was that morphology/lemma/deprel decisions should be consistent with each other and the XPOS could be a legacy thing.
Unless there are multiple SCONJes based on the same stem that we want to have the same lemma...it seems confusing to say that SCONJ "provided" is an instance of the lexical item "provide".
Some dictionaries with separate entries:
My concern is with backwards compatibility: for example, creating a diverging lemmatization standard means creating problems for concatenating training data from UD and non-UD corpora (or UD corpora you and I don't maintain) when training lemmatizers, or other tools which use lemmas as a feature.
Right, and I'm all for backward compatibility when there are two equally valid choices. But we're already breaking backward compatibility in a sense when we tag it as SCONJ rather than VERB. Saying some of our SCONJes have verbal morphology/lemmas seems like it would surprise users relying on the UD standard. (It would surprise me, anyway, unless the SCONJ
definition explained that in some languages there are subordinators that bear verbal morphology and receive a VerbForm
feature/verb stem lemma.)
Looking at https://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/langspec/specify_feature.pl?lcode=en&feature=VerbForm, I do see a number of languages besides English that allow SCONJ
to have a VerbForm
feature, though it is certainly not most languages. And I wonder if it is by design from a UD perspective or if it is just to accommodate legacy tagging decisions. Maybe @dan-zeman can enlighten us.
If we wanted to keep XPOS-compatible lemmas as side information I suppose those could go in MISC.
we're already breaking backward compatibility in a sense when we tag it as SCONJ rather than VERB
No, not really, since there is no pre-UD standard of what the upos for these things should be - upos is a "UD era" thing, so backwards compatibility doesn't enter into it. But I have literally billions of tokens of searchable English data in GU web interfaces, tagged with PTB tags, where the lemma of "regarding" in all uses is "regard", and I'm sure I'm not the only one. So we would certainly be creating a diverging standard if we do this, and students or other users who learn to use one corpus will be surprised when they use another coming from somewhere else. Tools which are no longer re-trained and ship with ready models will sometimes output "old" style lemmatization and surprise users and downstream applications alike.
Saying that the lemma of "regarding" is "regard" is not so shocking for me and I'm used to it, so I can live with it better than considering the potential chaos of having to guess with each corpus I run into what kind of lemmatization (or mix of lemmatization) standards it will have...
Being conservative about the lemmas might make life easier for working with other English corpora, but it might also make life harder for crosslinguistic analyses.
So I guess I'm asking about the scope of what UD is trying to standardize. Should the decisions about morphology/lemmas be strictly tied to the UPOS? Or can they be more loosely applied to reflect the etymology of a function word, for example? The status quo in GUM would amount to calling it morphologically-a-verb-syntactically-a-subordinator. If it turns out that lots of languages have a good reason to use VerbForm
on SCONJ
, maybe that's OK—I'd want it to be documented under SCONJ
though.
a number of languages besides English that allow SCONJ to have a VerbForm feature, though it is certainly not most languages. And I wonder if it is by design from a UD perspective or if it is just to accommodate legacy tagging decisions.
The language-UPOS-feature matrix was initialized by what was actually found in the treebanks at a given point in time. The reason was that I wanted to reduce the number of errors that must be fixed before the next data freeze. So it is possible to disallow VerbForm
for SCONJ
in English but initially it was allowed because it occurred at least once in the data. Maybe it wasn't a legacy tagging decision but a pure annotation error. From the (universal-level) UD perspective, any feature is allowed with any UPOS tag.
a number of languages besides English that allow SCONJ to have a VerbForm feature, though it is certainly not most languages. And I wonder if it is by design from a UD perspective or if it is just to accommodate legacy tagging decisions.
I do think that students (as well as any other users) should be warned that corpora may differ in tokenization, lemmatization etc. they use, and I do my best to make the students alert about this.
And while I would not break backward compatibility just for fun, I don't think it has to block evolution forever, and I wouldn't hesitate to break it if I think that a different analysis fits UD principles better. Old-tools-that-are-no-longer-maintained are destined to vanish anyway, sooner or later.
That said, I don't think there is a UD guideline that would order or at least recommend that VerbForm
be not used with derived SCONJ
. (In fact, the guidelines specifically say that VerbForm
can be used with non-verbs, but that statement involves borderline forms such as participles (could be ADJ
instead of VERB
), converbs (ADV
), and verbal nouns.) Personally I would prefer lemma regarding and no features when it is tagged SCONJ
. We have similar cases in Czech but I have not yet investigated whether they are annotated this way.
From offline discussion, the consensus that emerged was to allow VERB/mark for these connectives. The validator will need to be updated, probably to require lexical lists of VERBs that serve as case
or mark
. For now, keep verbal features but use SCONJ to satisfy the validator.
Another thing to address here is what CGEL calls the expandable construction—"considering that...", "provided that...", etc. (pp. 971, 982-983). CGEL calls these prepositions licensing that-clauses. In our terms that would be double-mark
, with the root of the that-clause attaching as advcl
. We already have something similar in EWT/GUM for a few instances of "except that", "given that", "save that". But "considering (that)" currently treats "considering" as an advcl
predicate with a complement.
I agree double mark sounds OK here - prepositions can introduce clauses in general, so this seems fine. As for predicate + complement or mark, that depends somewhat on the context (there are more transparent cases, and we've seen that for clausal 'including' and other as well). As long as we're ideally consistent on which cases are analyzed as clausal it should be good. For non-clausal 'expandable' cases, +1 for double mark.
One sentence has double-case "combined with" in EWT. Flagged by the validator because "combined_with" is not in the enhanced deprels list. Changing to ordinary verb analysis for now to avoid the error but we should really come up with a principled way to decide what's in this list of deverbal connectives.
The validator is complaining about a number of tokens that are tagged as
VERB
but attach asmark
. These are for the connectives:Tagged as
VERB
and serving ascase
we find in EWT and GUM:And
ADP
/case
:Consulting OntoNotes 5, we additionally find the following acting like prepositions:
Examples:
How to analyze these? They are transparently related to verbs but grammaticized as connectives.
E.g. "regarding" and "provided" do not lend themselves to a relative clause paraphrase, though "concerning" may:
Should these be tagged as
ADP
/SCONJ
rather thanVERB
, and if so, should the full word serve as the lemma?I suppose some of these might be analyzed as subjectless VP adjunct constructions ("a rule concerning X":
acl
with X as theobj
of "concerning"), but for many this seems counterintuitive.