Closed nschneid closed 2 years ago
I think all of these should be handled as cases of promotion with the appropriate deprel.
What is the "GUM validator" and where is it?
https://github.com/amir-zeldes/gum/blob/dev/_build/utils/validate.py - for the UD_English-GUM corpus
my port here: https://github.com/UniversalDependencies/UD_English-EWT/blob/dev/not-to-release/tools/neaten.py
That file is actually only a part of the larger GUM build bot, which includes all sorts of different validations, but most of them are in that file. The build bot is explained in some more human readable prose here:
https://gucorpling.org/gum/build.html
Some context: the reason for all this is that GUM is not built natively in CoNLL-U, but is actually created using a number of annotation interfaces, since it contains NNER, entity linking, coreference resolution, RST discourse parses and more, so a set of scripts ensures that all of the data is valid, matches across formats and, where possible, makes semantic sense across annotations. The CoNLL-U you see in the UD repository has most of these annotations merged in, and is generated by the build bot.
@amir-zeldes Any word that can be root
can also in principle be ccomp
if embedded in a speech predicate. In flag_dep_warnings()
, should "ccomp"
therefore be added as an option for the "CC"
, "RP"
, and "not"
tests?
Note that EWT doesn't have sentence type information so any s_type
checks are being skipped.
Are you saying that if someone says:
Then it's legitimate to have CC+ccomp? I'm not sure... I mean, it could be seen as ccomp, or maybe as obj.
But more generally, the GUM validator has mainly been written as a tool to ensure that GUM is clean, so in many places we haven't been thinking about what is conceivable, and more allowing things if they created a warning and were actually OK. In other words, it seems pretty unlikely (even if possible) that CC would be ccomp, so until that appears and triggers a warning, I'm happy enough not allowing it, because I might catch some errors until the first legitimate case occurs.
In that sense this is a little different from the UD validator, which has the objective of being authoritative, and therefore has to allow more or less any conceivable things. But I take your point that it's not impossible, that's certainly true (and I would change it if I had a legitimate case in the corpus).
Why is CC allowed to be root
? Apparently for fragmentary one-word utterances. I suppose any of those could be quoted (She started, "And—" but was cut off), which would require ccomp
.
More urgent is "not", which is needed for I say why not?.
I have no idea why RP is allowed to be root
. Maybe that was a copy-paste error.
Why is CC allowed to be root? Apparently for fragmentary one-word utterances.
Yes, exactly. There are 4 cases of "and" by itself so we needed to allow that. One of them is actually a question and not an interrupted fragment:
I suppose any of those could be quoted
Yes, if that happens the buildbot would need to allow it. But see my comment above: we only allow unusual things like that after they occur for the first time, since it helps us to catch errors.
I have no idea why RP is allowed
Me either - at least currently it doesn't appear (but maybe it's a case where the tag was eventually changed). Could be copy-paste.
Note that EWT doesn't have sentence type information so any s_type checks are being skipped.
You can actually add those if you like, either using DepEdit rules (not 100% accurate but decent):
https://github.com/amir-zeldes/DepEdit/blob/master/examples/eng_sent_type.ini
Or using a stochastic classifier that ships with Amalgum:
https://github.com/gucorpling/amalgum/blob/master/nlp_modules/s_typer.py
We need a policy for where "not" is not a premodifier:
All of these are errors according to the GUM validator.
("Not to mention" and "not only" are in #350)