Closed dan-zeman closed 10 months ago
@jnivre wrote: I completely agree that a common solution for Swedish, Danish and Norwegian would be highly preferable, so why don't we ping @liljao and @LarsAhrenberg and see if we can achieve this.
I agree too. I find many of Dan's suggestions too radical, however, at least for Swedish, as there are uses, in particular for the singular determiners den/det that need to be, and can be, distinguished systematically. I look forward to a joint discussion.
@dan-zeman I think all your suggestions make good sense.
Determiner uses should have UPOS DET, PronType=Art and appropriate Gender and Number features (but not Case or Person, which are not relevant there, at least not in Swedish).
Personal pronoun uses should have UPOS PRON, PronType=Prs and appropriate Gender, Number, Case and Person features.
Lemmatization should definitely neutralise gender (bringing together "den" and "det") and case (bringing together "de" and "dem"), possibly also number for determiners (adding "de" to "den" and "det") but probably not for pronouns (keeping 3rd person singular distinct from 3rd person plural).
The tricky part is what to do with demonstratives, which overlap with both determiner and pronoun uses. Starting with the former, there is a contrast between "bilen" (the car) and "den bilen" (that car). However, when this distinction is neutralised when there is an adjectival modifier, because that triggers article doubling. Hence "den röda bilen" is ambiguous in written Swedish between an article reading ("the red car") and a demonstrative reading ("that red car"), which would be disambiguated by stress in spoken Swedish (the demonstrative reading having stress on "den").
When it comes to pronominal uses, it is customary to treat "de", "den", "de" and "dem" as demonstrative pronouns (at least) when they are followed by one of the pronominal adverbs "här" (here) and "där" (there), where the latter encodes the proximal-distal distinction (corresponding to English "this" vs. "that"). The question, however, is whether we need to follow this tradition, or whether we could say that it is only the phrase "den här/där" that has a demonstrative function and that the constituent "den" is just an ordinary personal pronoun. Finally, there is the question of whether "den/det/de" by themselves can be considered demonstratives when emphasised, or whether that can also be treated as a pragmatic function, rather than as a lexical property. Note also that there is a corresponding series of true demonstratives "denna" (cf. "den"), "detta" (cf. "det"), "dessa" (cf. "de/dem").
I look forward to hearing everyone's view on these thoughts, as well as additional information about Danish and Norwegian.
A proposal for Swedish on the assumption that we continue to distinguish the traditionally demonstrative forms from the other PronTypes. Thisgives four alternatives for each of the words de/den/det, in Swedish, two as DET and two as PRON. The word dem gets one description.
DET de den Definite=Def|Number=Plur|PronType=Art (de mörka nätterna ~ the dark nights) DET de den Definite=Def|Number=Plur|PronType=Dem (de nätter/na ~those nights) DET den den Definite=Def|Gender=Com|Number=Sing|PronType=Art (den mörka natten ~the dark night) DET den den Definite=Def|Gender=Com|Number=Sing|PronType=Dem (den natt/en, den här/där natten ~that night) DET det den Definite=Def|Gender=Neut|Number=Sing|PronType=Art (det mörka rummet ~the dark room) DET det den Definite=Def|Gender=Neut|Number=Sing|PronType=Dem (det här/där rummet ~this/that room) PRON de de Definite=Def|Number=Plur|PronType=Dem (de här, de där irrespective of nominal deprel ~these/those) PRON de de Case=Nom|Definite=Def|Number=Plur|Person=3|PronType=Prs (de såg oss ~they saw us) PRON dem de Case=Acc|Definite=Def|Number=Plur|Person=3|PronType=Prs (vi såg dem ~we saw them) PRON den den Definite=Def|Gender=Com|Number=Sing|PronType=Dem (den här/där irrespective of nominal deprel) PRON den den Definite=Def|Gender=Com|Number=Sing|Person=3|PronType=Prs (jag såg den, den såg mig ~ I saw it, ...) PRON det den Definite=Def|Gender=Neut|Number=Sing|Person=3|PronType=Prs (jag såg det, det regnar ~ I saw it) PRON det den Definite=Def|Gender=Neut|Number=Sing|PronType=Dem (det här/där irrespective of nominal deprel)
I keep 'de' as the lemma for the PRON de as it forms a paradigm with the possessive: de, dem, deras, while the singular forms do not separate subject and object forms. With this logic, however, we get different lemmas for dom, a form that is becoming more and more common also in written language, as DET and PRON
DET dom den Definite=Def|Number=Plur|PronType=Art (or Dem) PRON dom de Definite=Def|Number=Plur|Person=3|PronType=Prs
Thanks, @LarsAhrenberg. This looks good to me. The fact that we get different lemmas for different uses of "dom" is perhaps a little annoying, but it is perfectly consistent with the fact that we also get different lemmas for different uses of "de". Or am I missing the point here?
The only way to avoid this would be to separate singular and plural forms also for the determiner uses. Did you consider that?
Finally, I observe that this gives up the idea of grouping definite articles together with indefinite articles by using the lemma "en" for all article uses of "den", "det" and "de". Personally, I think this is an improvement.
Shall we wait and see what our Danish and Norwegian colleagues have to say before we make a decision?
@AngledLuffa suggested that we might want to extend the discussion to include Icelandic and Faroese as well.
@jnivre, I didn't actually consider having separate singular and plural forms also for the determiner uses, but it is a definite possibility.
It seems that this is what Danish and Norwegian does (except for a few cases that could errors).
@jnivre wrote: I completely agree that a common solution for Swedish, Danish and Norwegian would be highly preferable, so why don't we ping @liljao and @LarsAhrenberg and see if we can achieve this.
Pinging also @peresolb who did the most recent changes in both Norwegian treebanks.
@AngledLuffa suggested that we might want to extend the discussion to include Icelandic and Faroese as well.
When creating the issue I considered adding statistics from Icelandic and Faroese too. But then I thought that things might get too complicated because they seem to have preserved more morphological variability (also in Faroese). But if a consensus among Danish, Swedish and Norwegian can be found at all, then it definitely won't hurt to see if some of the ideas can be projected to Faroese and Icelandic.
Agreed. Let’s do it in two steps.
Skickat från Outlook för iOShttps://aka.ms/o0ukef
Från: Dan Zeman @.> Skickat: Monday, November 20, 2023 4:39:17 PM Till: UniversalDependencies/docs @.> Kopia: Joakim Nivre @.>; Mention @.> Ämne: Re: [UniversalDependencies/docs] Annotation of den/det/de in Scandinavian (Issue #992)
@AngledLuffahttps://github.com/AngledLuffa suggested that we might want to extend the discussion to include Icelandic and Faroese as well.
When creating the issue I considered adding statistics from Icelandic and Faroese too. But then I thought that things might get too complicated because they seem to have preserved more morphological variabilityhttps://ielanguages.com/icelandic-demonstratives.html (also in Faroese)https://en.wikipedia.org/wiki/Faroese_grammar#Personal_Pronouns. But if a consensus among Danish, Swedish and Norwegian can be found at all, then it definitely won't hurt to see if some of the ideas can be projected to Faroese and Icelandic.
— Reply to this email directly, view it on GitHubhttps://github.com/UniversalDependencies/docs/issues/992#issuecomment-1819305197, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZ7ZVQMGGQKYP3W75SFJUTYFN2SLAVCNFSM6AAAAAA7ODUTJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJZGMYDKMJZG4. You are receiving this because you were mentioned.Message ID: @.***>
VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert. CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe.
När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy
Hi all!
Question: Could we select either Dem or Art and stick to it in all cases where these words are tagged DET? For those that are currently PronType=Prs, could it be decided that they either should be PRON, or their PronType should be changed?
I see no problem with that. We can change all DET
-tagged instances to PronType=Art
. There seem to be only two DET
-cases with PronType=Prs
in the Norwegian treebanks, and those can be changed to Art
too. There is a determiner/article contrast like the one @jnivre mentions in Norwegian too. However, the contrast isn't reflected in the analyses in the Norwegian treebanks and we don't have the bandwidth to try to introduce it, so I think we will stick to PronType=Art
consistently.
Lemmatization: I would have expected one lemma (probably den) for all these forms but it is definitely not the case and perhaps it is also not desired. ... Swedish LinES has mostly den as the lemma; the exception is plural de when tagged PRON (and not DET), which has lemma de in 507 cases (352 nominative de, 155 accusative dem) and only in 3 cases it is lemmatized as den. Question: Is there any chance we could get closer at least to the approach taken in LinES?
I am fine with setting "den" as lemma for all DET
uses and "de" and "den" for the PRON
uses, as @LarsAhrenberg suggests.
Question: Could Norwegian use Gender=Com for den?
We don't seem to use Gender=Com
at all in Norwegian for the time being, but we could replace all Gender=Fem,Mask
with Gender=Com
.
Thanks, @peresolb. It seems that we are converging on using separate lemmas for singular and plural for pronoun, but to use a single lemma for the determiner uses. Concerning demonstratives, it seems that Swedish LinES is the only treebank that really makes a distinction between demonstrative and non-demonstrative uses, while the other treebanks (with the possible exception of Danish DDT) uses either Dem or Art but not both. If this is the case, then I agree that it would probably require too much manual work to add this distinction to the annotations.
My interpretation for Swedish, then, is as follows, with a single description for each token, modulo the part-of-speech:
DET de de Definite=Def|Number=Plur|PronType=Art (de mörka/här nätterna ~ the dark nights, these nights) DET dom de Definite=Def|Number=Plur|PronType=Art (dom mörka/här nätterna ~ the dark nights, these nights) DET den den Definite=Def|Gender=Com|Number=Sing|PronType=Art (den mörka/här natten ~the dark night, this night) DET det den Definite=Def|Gender=Neut|Number=Sing|PronType=Art (det mörka/här rummet ~the dark room, this room)
PRON de de Case=Nom|Definite=Def|Number=Plur|Person=3|PronType=Prs (de såg oss, de här ~they saw us, these (guys)) PRON dem de Case=Acc|Definite=Def|Number=Plur|Person=3|PronType=Prs (vi såg dem ~we saw them) PRON dom de Definite=Def|Number=Plur|Person=3|PronType=Prs (vi såg dom, dom såg oss ~we saw them, they saw us) PRON den den Definite=Def|Gender=Com|Number=Sing|Person=3|PronType=Prs (jag såg den, den såg mig ~ I saw it, ...) PRON det den Definite=Def|Gender=Neut|Number=Sing|Person=3|PronType=Prs (jag såg det, det regnar ~ I saw it)
I assume, though, that PronType=Dem will still be used for the words denna, detta, dessa.
Do you think that Definite=Def
is useful/needed with the PRON
tag? Or is it because you regularly have it also on all nouns? (And does it mean that you would have it on other pronouns, too?)
I believe it has been there both in Swedish_LinES and Talbanken right since they were created. And it is used on nouns as well as pronouns. For instance, indefinite pronouns such as 'man' (one) and 'någonting' (something) are both Definite=Ind and PronType=Ind. This may be regarded as an unnecessary duplication, but on the other hand I don't see what harms it may cause.
Thanks @LarsAhrenberg. I assume this means that you are okay with losing the distinction between PronType=Art and PronType=Dem for DET and between PronType=Prs and PronType=Dem for PRON. I completely agree that PronType=Dem should be retained for "denna", etc.
I am still not sure what I think about having different lemmas for the singular and plural articles, but it does simplify things since "de", "dem" and "dom" will always be lemmatised "de", regardless of part-of-speech tag. What do others think?
I seem to have missed out on this conversation, and that it has died out despite the agreement. I don't mind doing the work for Danish - It seems like what needs to be done is:
PronType=Prs
(Pos: PRON) and PronType=Dem
PronType=Art
(Pos: DET)Anything I missed out on?
I don't think we have converged completely yet, but the current proposal is to mainly use PronType=Prs with PRON and PronType=Art (not PronType=Dem) with DET, since the distinction between PronType=Art and PronType=Dem is hard do make in written text. I assume Swedish and Danish are similar enough to use the same analysis here.
Thanks for the clarification.@jnivre I have corrected the above comment to Prontype=Art. The hope with the comment was to take the last steps toward reaching a consensus.
Thanks, @KennethEnevoldsen. I do think we have a coherent proposal now. So, unless anyone has additional thoughts, I think we should just go ahead and implement it in our various treebanks.
Here is a summary of the consensus as I understand it for "den", "det", "de", "dem" (and variants):
@LarsAhrenberg @peresolb @KennethEnevoldsen @dan-zeman If everyone agrees, we can close this issue and start implementing this in all our treebanks.
I agree.
Fine with me. Thanks for sorting this out!
Closed after reaching consensus. Everyone will do their best to implement this in their respective treebanks before the next release.
I am wondering whether annotation of den/dén/det/dét/de/dem can be unified across the Scandinavian languages. There are differences (and inconsistencies) in lemmatization, UPOS tags (
DET
vs.PRON
) and features. This is a generalization of https://github.com/UniversalDependencies/UD_Danish-DDT/issues/10.Here is the current situation (attested annotations, for now without counts):
Danish DDT:
Swedish Talbanken:
Swedish PUD:
Swedish LinES:
Norwegian Bokmaal:
Norwegian Nynorsk:
There is a consensus that some occurrences should be tagged
DET
and othersPRON
, so I am not going to challenge that for now. I will also ignore the occasional occurrences of other tags (ADP
,ADJ
,ADV
,PROPN
,X
). I have not examined them in context.As for
PronType
, the determiners are mostlyDem
in Danish and Norwegian, and mostlyArt
(withDefinite=Def
) in Swedish. But LinES uses bothArt
andDem
, and there are also occurrences ofPronType=Prs
in Talbanken and the Norwegian treebanks. Question: Could we select eitherDem
orArt
and stick to it in all cases where these words are taggedDET
? For those that are currentlyPronType=Prs
, could it be decided that they either should bePRON
, or theirPronType
should be changed?The
PronType
of thePRON
instances is eitherPrs
orDem
in Danish and Norwegian;Prs
,Ind
,Rel
,Tot
,Art
(!) in Talbanken;Prs
or empty in Swedish PUD;Prs
,Dem
,Art
in LinES. Question: Could it be alwaysPrs
in Swedish, too (as in Danish and Norwegian)?Lemmatization: I would have expected one lemma (probably den) for all these forms but it is definitely not the case and perhaps it is also not desired. What is always normalized is the accented version (dén vs. den, dét vs. det). Case is also normalized for the 3PL pronoun de "they" (nominative) vs. dem "them" (accusative); other forms do not seem to distinguish case. Gender and number sometimes is and sometimes is not normalized. So most Danish plural pronouns de are lemmatized to the plural form, but some of them have the singular lemma den. The neuter singular det is usually lemmatized as det but sometimes as the common gender form den (while it is never normalized from den to det). Danish also keeps a separate lemma De for the polite 2nd person address (taken from third person plural but capitalized). The two Norwegian treebanks mostly keep separate lemmas for the two singular genders and for the plural, but there are occasional outliers that break this rule (14 instances of den lemmatized as det in Bokmaal). Swedish LinES has mostly den as the lemma; the exception is plural de when tagged
PRON
(and notDET
), which has lemma de in 507 cases (352 nominative de, 155 accusative dem) and only in 3 cases it is lemmatized as den. Talbanken has a mixture of approaches; normalizing gender (det to den) seems to be the norm, although not kept 100%, plural stays separate, and in addition some of the words (bothDET
andPRON
!) are lemmatized to the indefinite article en. PUD has only 2 occurrences of en as lemma, otherwise determiners (both singular and plural) are lemmatized mostly to den, pronouns to their own gender/number (den to den, det to det, de and dem to de). Question: Is there any chance we could get closer at least to the approach taken in LinES?Features other than
PronType
andDefinite
: TheNumber
feature seems to be used everywhere (den, det areSing
, de, dem arePlur
) except for Nynorsk, which does not annotate singular. Gender is distinguished for the singular forms (den isCom
, det isNeut
), but the Norwegian treebanks ignore theGender=Com
feature and useMasc
,Fem
, orFem,Masc
. Question: Could Norwegian useGender=Com
for den? In Danish and Norwegian,Person=3
accompanies most of the personal pronoun instances, with occasionalPerson=2
for the polite addresses in Danish and Nynorsk. Swedish mostly lacks the feature, except a few instances in LinES. Question: CouldPerson=3
be added also in Swedish for pronouns (not for determiners)? Case is mostly used for plural pronouns to distinguish de (Nom
) from dem (Acc
) but Danish also has case for singular pronouns (probably incorrectly anyway; it should be removed) and LinES has it with plural determiners (probably to be removed too?)