UniversalDependencies / UD_Danish-DDT

Creative Commons Attribution Share Alike 4.0 International
8 stars 3 forks source link

Inconsistent (?) lemma for `enkelt` and `den/det/de` #10

Open AngledLuffa opened 10 months ago

AngledLuffa commented 10 months ago

Looking for ambiguous lemmas, I came across enkelt, but I am wondering if the ambiguity is just a typo. There are 14 examples of enkelt in the training data, 8 of which have enkelt as the lemma, 6 have enkel

There's also de which is lemmatized as de 241 times and den 20 times. That occurs often enough that I wonder if it's an actual feature of the language

Another couple examples:

stort ADV Counter({'stor': 5, 'stort': 2})
lige ADJ Counter({'lige': 2, 'lig': 1})
steg VERB Counter({'stige': 8, 'stege': 1})

And then det is sometimes tagged DET, sometimes PRON, and it seems to have two different lemmas based on the tag. Not 100% consistent, though

det PRON Counter({'det': 678, 'den': 3})
det DET Counter({'den': 333, 'det': 4})
Det PRON Counter({'det': 302, 'den': 2})
KennethEnevoldsen commented 10 months ago

So from my understanding there seems to be (at least) two use cases enkelt:

As an Adjective Describing a Concept: Where "enkelt" functions as an adjective directly modifying a noun (problem, state). It describes a quality of simplicity or straightforwardness. The word is used to convey that something is simple, straightforward, or uncomplicated. Lemmatized as "enkel" reflecting its adjectival role. As a Nominal Adjective Referring to an Individual: or "enkelt" still functions as an adjective, but it is used nominally to refer to a singular individual or entity (like saying "a single one" or "one person"). It’s more about identifying a specific individual from a group. In this usage, the lemma is "enkelt," which aligns with its role as a singular, indefinite, noun-like adjective.

"de" seems to be lemmatized differently depending on it being a determinant "de glade tressere" (the happy sixties)or a pron. "de må lære" (they will have to learn") referring to a formal singular (rarely used today, though some standard phrases uses it) or a group.

Hope it helps!

AngledLuffa commented 10 months ago

Thanks for the explanation!

Any thoughts on the de differences, or on the few exceptions to the apparent UPOS rule for det?

KennethEnevoldsen commented 10 months ago

de/den is pron/det it seems.

The UPOS for "det" does seem inconsistent. The inconsistencies for DET seems to occur in a standard phrase "i det hele taget" (in general / all in all). I am not sure whether it annotation is correct, but at least it is consistent. For "det" PRON it seem to be differentiated by PronType=Prs again not sure what the correct answer would be but at least consistent.

re steg/stige it has two different meanings (fry/climb). the cases with stort/stor also seem to be meaning-bearing e.g. "stort bagud" (greatly behind) vs "alarmerende stort" (alarmingly big).

KennethEnevoldsen commented 10 months ago

I will close this issue for know, but do reopen it if there is anything

AngledLuffa commented 10 months ago

The only thing that comes to mind is the few cases where det or Det is tagged as DET but lemmatized as det instead of den, or tagged PRON with lemma den instead of det. Presumably there need to be either a few lemma changes or a few tag changes for those sentences.

KennethEnevoldsen commented 10 months ago

If you have the knowledge to correct I would be very happy to review a PR

dan-zeman commented 10 months ago

This is the current situation with lemma den/det (count – form – upos – lemma – feats – deprel):

    921 det PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs nsubj
    782 den DET den Gender=Com|Number=Sing|PronType=Dem det
    683 de DET den Number=Plur|PronType=Dem det
    492 det DET den Gender=Neut|Number=Sing|PronType=Dem det
    188 det PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs obj
    120 den PRON den Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs nsubj
     75 det PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs obl
     42 den PRON den Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs obj
     17 det PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs nmod
     14 de PRON den Number=Plur|PronType=Dem nsubj
     12 det PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs root
     12 den PRON den Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs obl
      9 de PRON den Number=Plur|PronType=Dem nmod
      8 det PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs conj
      6 de PRON den Number=Plur|PronType=Dem obl
      5 det PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs dep
      5 den PRON den Gender=Com|Number=Sing|PronType=Dem obl
      4 det DET det Gender=Neut|Number=Sing|PronType=Dem fixed
      4 den PRON den Gender=Com|Number=Sing|PronType=Dem obj
      4 den PRON den Gender=Com|Number=Sing|PronType=Dem nmod
      4 den PRON den Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs root
      3 det PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs xcomp
      3 det PRON den Gender=Neut|Number=Sing|PronType=Dem nsubj
      3 de PRON den Number=Plur|PronType=Dem obj
      3 den PRON den Gender=Com|Number=Sing|PronType=Dem nsubj
      3 den PRON den Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs nmod
      3 dén DET den Gender=Com|Number=Sing|PronType=Dem det
      2 dét PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs obl
      2 det PRON den Gender=Neut|Number=Sing|PronType=Dem nmod
      2 den PRON den Gender=Com|Number=Sing|PronType=Dem dep
      2 den PRON den Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs iobj
      1 det PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs vocative
      1 dét PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs obj
      1 dét PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs nsubj
      1 det PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs iobj
      1 det PRON det Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs ccomp
      1 det PRON den Gender=Neut|Number=Sing|PronType=Dem obj
      1 det PRON den Gender=Neut|Number=Sing|PronType=Dem conj
      1 de PRON den Number=Plur|PronType=Dem conj
      1 d. DET den Gender=Com|Number=Sing|PronType=Dem det

My observations:

dan-zeman commented 10 months ago

BTW in the long run it would be useful to find a common solution for Danish, Swedish and Norwegian. That would be an issue for the main issue tracker in docs and I suspect it would be harder to reach consensus.

KennethEnevoldsen commented 10 months ago

An agreement between Danish, Swedish, and Norwegian would be a great idea.

jnivre commented 10 months ago

I completely agree that a common solution for Swedish, Danish and Norwegian would be highly preferable, so why don't we ping @liljao and @LarsAhrenberg and see if we can achieve this.

However, the situation is a little bit more complicated than @dan-zeman's analysis suggests. In Swedish, "den", "det", "de" have at least three distinct uses. (The object form "dem" has a more limited distribution.) They are used as (i) definite articles (corresponding to English "the" but with gender and number agreement), (ii) personal pronouns (corresponding to English "it" and "them", and contrasting in the singular with "han" (he) and "she" (her)), and (iii) demonstratives (corresponding to English "this" and "that", although the proximal-distal distinction is expressed by adding a pronominal adverb: "den här" = "this" (lit. "it here") vs. "den där" = "that" (lit. "it there")).

For the demonstrative, we could conceivable use a single tag and view the pronominal use ("den (här)" = "this") and the determiner use ("den (här) bilen" = "this car") as an alternation similar to the one in English (if they use a single tag for this). But it would be really weird to tag the personal pronoun use ("it") as DET, or to tag the definite article use ("the") as PRON.

When it comes to lemmatisation, I agree that "den"/"det" should definitely have the same lemma, and this could be extended to "de"/"dem", although there is the complication that "dem" is only ever used as a personal pronoun (never as an article or demonstrative). A further complication is that the distinction "de/dem" is neutralised in speech and informal writing to "dom".

Swedish-Talbanken currently uses different groupings for the pronominal and the article uses. For pronouns, "den" and "det" are both lemmatised to "den" (3rd person singular, inanimate), while "de" and "dem" are both lemmatised to "de" (3rd person plural). I think this is common in many treebanks, which prefer to keep (at least) the six person-number combinations distinct for pronouns. However, when it comes to the article use, Swedish-Talbanken takes a rather extreme approach and lemmatises all of "den", "det" and "de" not to "den" but to "en" (= "a(n)"). In other words, there is a single lemma for all articles, indefinite and definite, singular and plural. This was not a conscious decision on our part, but rather something that was built into the lemmatiser we used (from Språkbanken in Gothenburg).

I guess the first thing to do is to compare notes between the three languages and see how different they are (regardless of consistency errrors, of which there seem to be a few also in Swedish-Talbanken).

dan-zeman commented 10 months ago

Thanks, Joakim. May I suggest that the cross-language-treebank discussion is continued in https://github.com/UniversalDependencies/docs/issues/992, which I created after contributing here? I added statistics from the other treebanks there, and also stepped back from some of the more radical thoughts I previously expressed here :-)

jnivre commented 10 months ago

Sure. I am just trying to catch up. 😊

From: Dan Zeman @.> Reply to: UniversalDependencies/UD_Danish-DDT @.> Date: Thursday, 16 November 2023 at 21:43 To: UniversalDependencies/UD_Danish-DDT @.> Cc: Joakim Nivre @.>, Comment @.***> Subject: Re: [UniversalDependencies/UD_Danish-DDT] Inconsistent (?) lemma for enkelt and den/det/de (Issue #10)

Thanks, Joakim. May I suggest that the cross-language-treebank discussion is continued in UniversalDependencies/docs#992https://github.com/UniversalDependencies/docs/issues/992, which I created after contributing here? I added statistics from the other treebanks there, and also stepped back from some of the more radical thoughts I previously expressed here :-)

— Reply to this email directly, view it on GitHubhttps://github.com/UniversalDependencies/UD_Danish-DDT/issues/10#issuecomment-1815282974, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZ7ZVTWMXY2LKP5LGN6JIDYEZ3HVAVCNFSM6AAAAAA7FWS3XCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJVGI4DEOJXGQ. You are receiving this because you commented.Message ID: @.***>

VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert. CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe.

När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy