clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

DK: usage of unknown dependency obl:loc #737

Closed TomazErjavec closed 1 year ago

TomazErjavec commented 1 year ago

In preparation for 3.1 we are now using the common UD-SYN taxonomy. But using it with the DK corpus gives many errors like:

ERROR ParlaMint-DK_2017-05-18-20161-M99.ana: ERROR: Can't find local id for link/@ana="ud-syn:obl_loc"
...
ERROR ParlaMint-DK_2014-10-07-20141-M1.ana: ERROR: Can't find local id for link/@ana="ud-syn:obl_loc"
...

obl:loc does not seem to be a legal UD syntactic relations, so what to do with this now? We probably can't leave it as it is, as it doesn't mean anything. Change it to simple obl? Ideas welcome!

matyaskopp commented 1 year ago

For Danish, an old 2.5 model is used. In Universal Dependencies 2.5 – Danish – DDT was this relation used: http://hdl.handle.net/11346/PMLTQ-0RGX (612 occurences) But the taxonomy corresponds to the current version of UD (current documentation, to be precise).

note for @matyaskopp: this generates query with all ids with obl:loc that can be run in different version of ud to see changes http://hdl.handle.net/11346/PMLTQ-COKG

In 2.12 they have been replaced with: http://hdl.handle.net/11346/PMLTQ-8FW4 relation occurences
obl:lmod 48
obl 1
case 3
advmod:lmod 560

So, the question is whether we want to support old undocumented language-specific relations. It is in some old statistics: https://github.com/UniversalDependencies/docs/blob/97694404898cc696842234a1ebabb888c448f09b/_includes/stats/da/dep/obl-loc.md But in fact, it has never been documented, The Danish language has not ever documented specific relations in the whole history:

git clone git@github.com:UniversalDependencies/docs.git Scripts/UD-docs
git -C Scripts/UD-docs checkout pages-source
git -C Scripts/UD-docs log --all --full-history -- "_da/dep/*"

Returns an empty result.

TomazErjavec commented 1 year ago

Wow, a very detailed analysis!

So, the question is whether we want to support old undocumented language-specific relations.

I would say not.

Do I understand correctly that the most sensible substitution would be to obl:mod? Currently I made it to obl.

matyaskopp commented 1 year ago

The most sensible substitution is advmod:lmod if ADV, obl:lmod otherwise. (http://hdl.handle.net/11346/PMLTQ-ORIW)

relation pos occurences
advmod:lmod ADV 560
case ADP 3
obl NOUN 1
obl:lmod NOUN 27
obl:lmod ADP 15
obl:lmod VERB 4
obl:lmod ADJ 1
obl:lmod X 1
TomazErjavec commented 1 year ago

The most sensible substitution is advmod:lmod if ADV, obl:lmod otherwise

Hm, maybe most correct, not sure about sensible, because the code now does not have access to the PoS of the word: https://github.com/clarin-eric/ParlaMint/blob/d02bd049213da4a3d1e50bea07df01215883fc0b/Scripts/parlamint2release.xsl#L562-L573.

Trying to implement PoS-dependnet dependency would be difficult. I would just set it to advmod:lmod, as this seems to mean only about 10% of errors. Which is about par on the error rate parsers make anyway...

matyaskopp commented 1 year ago

Trying to implement PoS-dependnet dependency would be difficult. I would just set it to advmod:lmod, as this seems to mean only about 10% of errors. Which is about par on the error rate parsers make anyway...

ok, but it will probably produce an L2 validation error - I think advmod should be related to ADV

TomazErjavec commented 1 year ago

ok, but it will probably produce an L2 validation error - I think advmod should be related to ADV

Ah. But given that we are patching things, might as well have some errrors...

TomazErjavec commented 1 year ago

Surprisingly, no CoNLL-I errors were produced. So, closing.