"not" idioms - Githubissues

nschneid commented 2 years ago

We need a policy for where "not" is not a premodifier:

[x] more often/likely than not
[x] Why not?
[x] I'm afraid not.
[x] "those affluent and those not"

All of these are errors according to the GUM validator.

("Not to mention" and "not only" are in #350)

amir-zeldes commented 2 years ago

I think all of these should be handled as cases of promotion with the appropriate deprel.

arademaker commented 2 years ago

What is the "GUM validator" and where is it?

nschneid commented 2 years ago

https://github.com/amir-zeldes/gum/blob/dev/_build/utils/validate.py - for the UD_English-GUM corpus

my port here: https://github.com/UniversalDependencies/UD_English-EWT/blob/dev/not-to-release/tools/neaten.py

amir-zeldes commented 2 years ago

That file is actually only a part of the larger GUM build bot, which includes all sorts of different validations, but most of them are in that file. The build bot is explained in some more human readable prose here:

https://gucorpling.org/gum/build.html

Some context: the reason for all this is that GUM is not built natively in CoNLL-U, but is actually created using a number of annotation interfaces, since it contains NNER, entity linking, coreference resolution, RST discourse parses and more, so a set of scripts ensures that all of the data is valid, matches across formats and, where possible, makes semantic sense across annotations. The CoNLL-U you see in the UD repository has most of these annotations merged in, and is generated by the build bot.

nschneid commented 2 years ago

@amir-zeldes Any word that can be root can also in principle be ccomp if embedded in a speech predicate. In flag_dep_warnings(), should "ccomp" therefore be added as an option for the "CC", "RP", and "not" tests?

Note that EWT doesn't have sentence type information so any s_type checks are being skipped.

amir-zeldes commented 2 years ago

Are you saying that if someone says:

Kim said "and"

Then it's legitimate to have CC+ccomp? I'm not sure... I mean, it could be seen as ccomp, or maybe as obj.

But more generally, the GUM validator has mainly been written as a tool to ensure that GUM is clean, so in many places we haven't been thinking about what is conceivable, and more allowing things if they created a warning and were actually OK. In other words, it seems pretty unlikely (even if possible) that CC would be ccomp, so until that appears and triggers a warning, I'm happy enough not allowing it, because I might catch some errors until the first legitimate case occurs.

In that sense this is a little different from the UD validator, which has the objective of being authoritative, and therefore has to allow more or less any conceivable things. But I take your point that it's not impossible, that's certainly true (and I would change it if I had a legitimate case in the corpus).

nschneid commented 2 years ago

Why is CC allowed to be root? Apparently for fragmentary one-word utterances. I suppose any of those could be quoted (She started, "And—" but was cut off), which would require ccomp.

More urgent is "not", which is needed for I say why not?.

I have no idea why RP is allowed to be root. Maybe that was a copy-paste error.

amir-zeldes commented 2 years ago

Why is CC allowed to be root? Apparently for fragmentary one-word utterances.

Yes, exactly. There are 4 cases of "and" by itself so we needed to allow that. One of them is actually a question and not an interrupted fragment:

And?

I suppose any of those could be quoted

Yes, if that happens the buildbot would need to allow it. But see my comment above: we only allow unusual things like that after they occur for the first time, since it helps us to catch errors.

I have no idea why RP is allowed

Me either - at least currently it doesn't appear (but maybe it's a case where the tag was eventually changed). Could be copy-paste.

Note that EWT doesn't have sentence type information so any s_type checks are being skipped.

You can actually add those if you like, either using DepEdit rules (not 100% accurate but decent):

https://github.com/amir-zeldes/DepEdit/blob/master/examples/eng_sent_type.ini

Or using a stochastic classifier that ships with Amalgum:

https://github.com/gucorpling/amalgum/blob/master/nlp_modules/s_typer.py

UniversalDependencies / UD_English-EWT

"not" idioms #352