UniversalDependencies / UD_English-GUM

Other
30 stars 4 forks source link

Conjunct wrongly attached to prepositions #41

Open martinpopel opened 2 years ago

martinpopel commented 2 years ago

There are 94 sentences in GUM found by the following Udapi query: cat *.conllu | udapy -TM util.Mark node='node.deprel=="conj" and node.parent.deprel in ("mark","case") and node.is_nonprojective()' | less -R

The conjunct depends non-projectively on a preposition (or subordinating conjunction), such as in the following example:

# sent_id = GUM_voyage_athens-2
...
   │ ╭──────────────────────┮ of ADP case
   │ ┢─╼ Classical ADJ amod │
   ┡─┶ Greece PROPN nmod    │
   │                        │ ╭─╼ , PUNCT punct
   │                        │ ┢─╼ and CCONJ cc
   │                        │ ┢─╼ therefore ADV advmod
   │                        │ ┢─╼ of ADP case
   │                        │ ┢─╼ Western ADJ amod
   │                        ╰─┶ civilization NOUN conj

It seems that in all these cases the same preposition is repeated ("of X and of Y"). The restriction to non-projective constructions is needed to filer out phrases such as "before and after", which are parsed correctly.

These cases can be automatically fixed (after confirming my expectation that all are errors of this type) using udapy -s util.Eval node='if node.deprel=="conj" and node.parent.deprel in ("mark","case") and node.is_nonprojective(): node.parent = node.parent.parent' < old.conllu > fixed.conllu

nschneid commented 2 years ago

Not exactly the same query but I think it is similar: coordination between an ADP and a non-ADP

martinpopel commented 2 years ago

Yes, the pattern { X-[conj]->Y; X[upos=ADP]; Y[upos<>ADP] } Grew-match query is similar. It misses one case of "due/ADJ/case to/ADP/fixed" and it includes several projective cases like "at, or close/ADV/conj to", which are parsed correctly.

That said, now I see my query should also include the node.upos!="ADP" condition, so the result does not include "with medicine or without", which is parsed correctly, despite being non-projective.

amir-zeldes commented 2 years ago

Thanks for catching - by pure coincidence I ran into the same issue while doing consistency checks for the upcoming GUM8 data, so it will be repaired soon! I'm pretty sure this is an artefact from an uncaught conversion error when GUM switched from SD to UD about 4 years ago, so it probably only occurs in the older part of the corpus.