UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

Treatment of split "what a ((ADJ) NOUN)" construction in Low Saxon and Dutch #1021

Closed jasiewert closed 6 months ago

jasiewert commented 8 months ago

In Low Saxon, as well as in Dutch, the construction corresponding to English "what a (ADJ) NOUN" in expressions of surprise can be interrupted by other elements.

In the English GUM treebank, they call this a det:predet, a relation that is not documented here and only listed in the language-specific guidelines.

# sent_id = GUM_vlog_lipstick-41
# addressee = Hershey
# s_prominence = 3
# s_type = other
# speaker = AlyssaMarie
# transition = establishment
# text = What a sweet baby.
# newpar
# newpar_block = sp who:::"#AlyssaMarie" whom:::"#Hershey" (1 s)
1   What    what    DET WDT PronType=Int    4   det:predet  4:det:predet    Discourse=evaluation-comment:56->57:0:lex-indwd-387
2   a   a   DET DT  Definite=Ind|PronType=Art   4   det 4:det   Entity=(33-animal-giv:act-cf1*-3-coref
3   sweet   sweet   ADJ JJ  Degree=Pos  4   amod    4:amod  _
4   baby    baby    NOUN    NN  Number=Sing 0   root    0:root  Cxn=Exclamative-What|Entity=33)|MSeg=bab-y|SpaceAfter=No
5   .   .   PUNCT   .   _   4   punct   4:punct _

In the Dutch Alpino treebank, however, the interrogative(?) pronoun wat and the indefinite article een are treated as fixed even when they are separated by other elements:

# source = Treebank/cgn_exs/257.xml
# sent_id = cgn_exs\257
# text = wat een boeken heeft die man gelezen!
# auto = ALUD2.8.5
1   wat wat PRON    VNW|excl|pron|stan|vol|3|getal  Person=3    3   det 3:det   _
2   een een DET LID|onbep|stan|agr  Definite=Ind    1   fixed   1:fixed _
3   boeken  boek    NOUN    N|soort|mv|basis    Number=Plur 7   obj 7:obj   _
4   heeft   hebben  AUX WW|pv|tgw|met-t Number=Sing|Tense=Pres|VerbForm=Fin 7   aux 7:aux   _
5   die die DET VNW|aanw|det|stan|prenom|zonder|rest    _   6   det 6:det   _
6   man man NOUN    N|soort|ev|basis|zijd|stan  Gender=Com|Number=Sing  7   nsubj   7:nsubj _
7   gelezen lezen   VERB    WW|vd|vrij|zonder   VerbForm=Part   0   root    0:root  SpaceAfter=No
8   !   !   PUNCT   LET _   7   punct   7:punct _
# source = Treebank/cgn_exs/258.xml
# sent_id = cgn_exs\258
# text = wat heeft die man een boeken gelezen!
# auto = ALUD2.8.5
1   wat wat PRON    VNW|excl|pron|stan|vol|3|getal  Person=3    6   det 6:det   _
2   heeft   hebben  AUX WW|pv|tgw|met-t Number=Sing|Tense=Pres|VerbForm=Fin 7   aux 7:aux   _
3   die die DET VNW|aanw|det|stan|prenom|zonder|rest    _   4   det 4:det   _
4   man man NOUN    N|soort|ev|basis|zijd|stan  Gender=Com|Number=Sing  7   nsubj   7:nsubj _
5   een een DET LID|onbep|stan|agr  Definite=Ind    1   fixed   1:fixed _
6   boeken  boek    NOUN    N|soort|mv|basis    Number=Plur 7   obj 7:obj   _
7   gelezen lezen   VERB    WW|vd|vrij|zonder   VerbForm=Part   0   root    0:root  SpaceAfter=No
8   !   !   PUNCT   LET _   7   punct   7:punct _

A construction that can be separated by several words belonging to different syntactic phrases does not sound very fixed to me. Neither does a det(:predet)relation seem intuitive if the supposed predeterminer can be placed several syntactic phrases apart from the noun it determines. Should I nevertheless follow the English or the Dutch analysis in my Low Saxon treebank or would a different analysis be more appropriate?

dan-zeman commented 8 months ago

I would slightly lean towards det (the :predet subtype is not needed but it could be defined and documented for the language, too). A "fixed" expression with gaps will not render the treebank invalid but it will yield warnings (as it does for Dutch).

gossebouma commented 8 months ago

perhaps there is inspiration to be found from the German treebank? I am thinking of the German equivalent (assuming those are similar) to the Dutch split "wat voor" cases.

Wat heb je voor (een) boeken gelezen?

GJ

On Mon, Mar 25, 2024 at 10:45 AM @. @.> wrote:

the UD validator complains about this type of annotation as well, so we may need to reconsider it ;-) It is actually a relic from the underlying treebank (the syntactically annotated part of the Corpus of Spoken Dutch) where this was annotated as a discontinuous multi-word expression (stretching the notion of MWE a bit). It is true that the construction is exceptional, and there are cases like the first Dutch example that do suggest that there is a determiner 'wat een' that should be analyzed as MWE. As soon as you do that, however, the discontinuous cases become a problem. Suggestions for a better solution are welcome!

On 3/25/24 09:00, Janine Siewert wrote:

In Low Saxon, as well as in Dutch, the construction corresponding to English "what a (ADJ) NOUN" in expressions of surprise can be interrupted by other elements.

In the English GUM treebank, they call this a det:predet, a relation that is not documented here https://universaldependencies.org/u/dep/index.html and only listed in the language-specific guidelines.

sent_id = GUM_vlog_lipstick-41

addressee = Hershey

s_prominence = 3

s_type = other

speaker = AlyssaMarie

transition = establishment

text = What a sweet baby.

newpar

newpar_block = sp who:::"#AlyssaMarie" whom:::"#Hershey" (1 s)

1 What what DET WDT PronType=Int 4 det:predet 4:det:predet Discourse=evaluation-comment:56->57:0:lex-indwd-387 2 a a DET DT Definite=Ind|PronType=Art 4 det 4:det Entity=(33-animal-giv:act-cf1*-3-coref 3 sweet sweet ADJ JJ Degree=Pos 4 amod 4:amod 4 baby baby NOUN NN Number=Sing 0 root 0:root Cxn=Exclamative-What|Entity=33)|MSeg=bab-y|SpaceAfter=No 5 . . PUNCT . 4 punct 4:punct _

In the Dutch Alpino treebank, however, the interrogative(?) pronoun wat and the indefinite article een are treated as fixed even when they are separated by other elements:

source = Treebank/cgn_exs/257.xml

sent_id = cgn_exs\257

text = wat een boeken heeft die man gelezen!

auto = ALUD2.8.5

1 wat wat PRON VNW|excl|pron|stan|vol|3|getal Person=3 3 det 3:det 2 een een DET LID|onbep|stan|agr Definite=Ind 1 fixed 1:fixed 3 boeken boek NOUN N|soort|mv|basis Number=Plur 7 obj 7:obj as such 4 heeft hebben AUX WW|pv|tgw|met-t Number=Sing|Tense=Pres|VerbForm=Fin 7 aux 7:aux 5 die die DET VNW|aanw|det|stan|prenom|zonder|rest 6 det 6:det 6 man man NOUN N|soort|ev|basis|zijd|stan Gender=Com|Number=Sing 7 nsubj 7:nsubj 7 gelezen lezen VERB WW|vd|vrij|zonder VerbForm=Part 0 root 0:root SpaceAfter=No 8 ! ! PUNCT LET 7 punct 7:punct _

source = Treebank/cgn_exs/258.xml

sent_id = cgn_exs\258

text = wat heeft die man een boeken gelezen!

auto = ALUD2.8.5

1 wat wat PRON VNW|excl|pron|stan|vol|3|getal Person=3 6 det 6:det 2 heeft hebben AUX WW|pv|tgw|met-t Number=Sing|Tense=Pres|VerbForm=Fin 7 aux 7:aux 3 die die DET VNW|aanw|det|stan|prenom|zonder|rest 4 det 4:det 4 man man NOUN N|soort|ev|basis|zijd|stan Gender=Com|Number=Sing 7 nsubj 7:nsubj 5 een een DET LID|onbep|stan|agr Definite=Ind 1 fixed 1:fixed 6 boeken boek NOUN N|soort|mv|basis Number=Plur 7 obj 7:obj 7 gelezen lezen VERB WW|vd|vrij|zonder VerbForm=Part 0 root 0:root SpaceAfter=No 8 ! ! PUNCT LET 7 punct 7:punct _

A construction that can be separated by several words belonging to different syntactic phrases does not sound very fixed to me. Neither does a det(:predet)relation seem intuitive if the supposed predeterminer can be placed several syntactic phrases apart from noun it determines. Should I nevertheless follow the English or the Dutch analysis in my Low Saxon treebank or would a different analysis be more appropriate?

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/1021, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMMHZJUV3YXHOHTSUOKIXDYZ7KSDAVCNFSM6AAAAABFGQZV46VHI2DSMVQWIX3LMV43ASLTON2WKOZSGIYDKMJSGQ3TCNY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Gosse Bouma, Communication and Information Science, Groningen University, P.o. box 716, 9700 AS @.*** tel. +31-50-3635937

dan-zeman commented 8 months ago

I suppose the German equivalent is Was für ein X! I found three instances in German GSD (among the 8 hits returned by this query), there are two other approaches as inspiration :-) One of them seems wrong to me, the other is

nmod(was, X)
case(X, für)
det(X, ein)

which I sort of like.

jasiewert commented 8 months ago

Another reason why I am hesitant to use the det relation is the fact that wat can be used in the same way in other expressions of surprise as well:

# sent_id = LSDC_011_DNS_1904_HAM_bahnmeester_dood
# text_orig = Wat mien Süsterdochter fix is: de kann Gedanken läsen!
# text = Wat myn süsterdochter fiks is: dee kan gedanken leasen!
1   Wat wat PRON    _   _   4   ? _   lemma_gml=watte
2   myn myn DET _   Case=Nom|Number=Sing|Number[psor]=Sing|Person[psor]=1|Poss=Yes|PronType=Prs 3   det:poss    _   lemma_gml=mîn
3   süsterdochter   süsterdochter   NOUN    _   Case=Nom|Gender=Fem|Number=Sing 4   nsubj   _   lemma_gml=süsterdochter
4   fiks    fiks    ADJ _   Degree=Pos  0   root    _   _
5   is  weasen  AUX _   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|VerbType=Cop  4   cop _   lemma_gml=wēsen|SpaceAfter=No
6   :   :   PUNCT   _   _   10  punct   _   _
7   dee dee PRON    _   Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Dem   10  nsubj   _   lemma_gml=dê
8   kan künnen  AUX _   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|VerbType=Aux  10  aux _   lemma_gml=künnen
9   gedanken    gedanke NOUN    _   Case=Acc|Gender=Masc|Number=Plur    10  obj _   lemma_gml=gedanke
10  leasen  leasen  VERB    _   VerbForm=Inf    4   parataxis   _   lemma_gml=lēsen|SpaceAfter=No
11  !   !   PUNCT   _   _   10  punct   _   _

I think a comparable usage is possible in Dutch as well, but I do not have native speaker intuition there. In the example above, there is nothing that wat could be a determiner of, but it intuitively feels like the same usage as in:

# sent_id = LSDC_0297_NWF_1882_OVY_deventer_t.w._van_marie_-_de_bruud_en_de_wedevrouwe
# text_orig = Jonges, jonges, wat zi'j en mooi knap bruudjen 'eworden!
# text = Junges, junges, wat sin y en mooi knap brüüdjen eworden!
1   Junges  junge   NOUN    _   Case=Nom|Gender=Masc|Number=Plur    12  vocative    _   lemma_gml=junge|SpaceAfter=No
2   ,   ,   PUNCT   _   _   3   punct   _   _
3   junges  junge   NOUN    _   Case=Nom|Gender=Masc|Number=Plur    1   conj    _   lemma_gml=junge|SpaceAfter=No
4   ,   ,   PUNCT   _   _   12  punct   _   _
5   wat wat PRON    _   Case=Nom|Gender=Neut|Number=Sing|PronType=Int   11  ?   _   lemma_gml=watte
6   sin weasen  AUX _   Mood=Ind|Number=Plur,Sing|Person=2|Tense=Pres   12  aux _   lemma_gml=wēsen
7   y   jy  PRON    _   Case=Nom|Number=Plur,Sing|Person=2|PronType=Prs 12  nsubj   _   lemma_gml=gî
8   en  en  DET _   Case=Nom|Definite=Ind|Gender=Neut|Number=Sing|PronType=Art  11  det _   lemma_gml=êin,êine,êin
9   mooi    mooi    ADJ _   Case=Nom|Degree=Pos|Gender=Neut|Number=Sing 11  amod    _   lemma_gml=mö̂ye
10  knap    knap    ADJ _   Case=Nom|Degree=Pos|Gender=Neut|Number=Sing 11  amod    _   lemma_gml=knap
11  brüüdjen    brüüdken    NOUN    _   Case=Nom|Gender=Neut|Number=Sing    12  xcomp   _   lemma_gml=brü̂deken
12  eworden werden  VERB    _   Tense=Past|VerbForm=Part    0   root    _   lemma_gml=wērden|SpaceAfter=No
13  !   !   PUNCT   _   _   12  punct   _   _
jasiewert commented 8 months ago

In the second example, an interpretation as nmod might work, but this is not possible for the kind of usage in the first example.

jasiewert commented 8 months ago

There is indeed a Dutch example of such a usage in the Alpino treebank. Here, is it analysed as obl:

# source = Treebank/eans/02_02_05_i.xml
# sent_id = eans\02_02_05_i
# text = Wat typt deze machine zwaar!
# auto = ALUD2.8.5
1   Wat wat PRON    VNW|excl|pron|stan|vol|3|getal  Person=3    2   obl 2:obl   _
2   typt    typen   VERB    WW|pv|tgw|met-t Number=Sing|Tense=Pres|VerbForm=Fin 0   root    0:root  _
3   deze    deze    DET VNW|aanw|det|stan|prenom|met-e|rest _   4   det 4:det   _
4   machine machine NOUN    N|soort|ev|basis|zijd|stan  Gender=Com|Number=Sing  2   nsubj   2:nsubj _
5   zwaar   zwaar   ADJ ADJ|vrij|basis|zonder   Degree=Pos  2   advmod  2:advmod    SpaceAfter=No
6   !   !   PUNCT   LET _   2   punct   2:punct _
gossebouma commented 8 months ago

That is an example sentence from a Dutch reference grammar, Algemene Nederlandse Spraakkunst. See here and here for discussion (in Dutch) of the 'wat voor (een)' cases. Thet authors do consider 'wat voor (een)' to be a (complex) determiner, About the discontinuous cases, they say that the interrogative pronoun, if it is part of the NP, can be separated from the rest of the NP. The rest of the NP can be moved rightward, but the WH-prnoun has to remain in sentence initial position. This would suggest consituent-hood at some level, so maybe the predet solution is a way out.

jasiewert commented 8 months ago

Using the predet solution would however mean that the wat in Wat myn süsterdochter en mooi brüüdjen is. needs to be analysed differently from comparable cases such as Wat myn süsterdochter fiks is. or Wat myn süsterdochter mooi singen kan. I would prefer to analyse wat in the same way here.

nschneid commented 8 months ago

In English, exclamative "what" can mark indefinite nominals whether singular ("what a great book") or plural ("what great books"), so we consider it a predeterminer.

The combination "many a" is arguably a fixed expression (a complex determiner), but in UD we are treating "many" as a predeterminer as well. (Though the traditions for tagging/parsing predeterminers in English are questionable: UniversalDependencies/UD_English-EWT#412)

jasiewert commented 8 months ago

Here are English translations and glosses of the last three Low Saxon examples. I hope this makes it clear that the construction does not directly correspond to the English one:

A) Wat myn süsterdochter en mooi brüüdjen is! - What a beautiful bride my niece is!

Wat myn süsterdochter   en  mooi    brüüdjen    is
what    my  niece   a   beautiful   bride   is

B) Wat myn süsterdochter fiks is! - How smart my niece is!

Wat myn süsterdochter   fiks    is
what    my  niece   smart   is

C) Wat myn süsterdochter mooi singen kan! - How beautifully my niece can sing!

Wat myn süsterdochter   mooi    singen  kan
what    my  niece   beautifully sing    can

In all three examples, is it also possible to put the verb in the second position, e.g. Wat is myn süsterdochter fiks!

Example A might be analysed in the same way as English, but the analysis as a predeterminer does not work for B and C. As a Low Saxon speaker, it would feel very unintuitive to me to treat example A differently from example B. Also example C can probably be considered to represent the same type of construction.

nschneid commented 8 months ago

Is "myn süsterdocter" the subject? Maybe "wat" is just an adverb in all 3 cases, more like English "how" than "what"?

jasiewert commented 8 months ago

Yes, "myn süsterdochter" is the subject in all three examples. "Wat" is generally just an ordinary interrogative pronoun like its English cognate "what", but I agree that it does not seem to behave like an interrogative pronoun here.

gossebouma commented 8 months ago

I am not sure that a single analysis for sentence-initial 'wat' can work.

There seems to be a concensus in the literature that 'wat een X' is a phrase, with a complex determiner of some kind. Arguments for this analysis, I think, are:

'wat een knappe kinderen zijn jullie' what a smart kids you are

As Dutch is a pretty strict V2 language, the fact that 'wat een N' precedes the verb 'zijn' is a strong argument for constituenthood, thus arguing against an analysis where 'wat' would be a adverb that is not part of the NP. Second, in split cases like

wat zijn jullie een knappe kinderen what are you a smart kids

there is the peculiarity that the determiner 'een', which normally can only occur with singular count nouns, precedes a plural noun. One explanation is to assume that we are dealing with the complex determiner 'wat een', despite the fact that it is discontinuous.

A det:predet analysis for 'wat' therefore seems best to me.

Cases where 'wat' occurs sentence initially but not in combination with 'een + N', as in the example from Alpino you mention, must be analyzed differently, ie with 'wat' being an obl.

rueter commented 8 months ago

It would be interesting to know to what extent `traditional sentence' types are represented in UD and how they are annotated, i.e., statement, question, command and exclamation.

wat zijn jullie een knappe kinderen
what are you a smart kids

It would seem that this is an exclamation, and I would think that perhaps such a sentence-initial marker wat might work as an ADV or PART with an advmod connection to the predicate. This, of course, may not be intuitive if there seems to be a direct affinity to the phrase adjacent to the sentence-initial wat.

As an exclamation marker, it would be analogous to the question marker kas in Estonian, that is called ADV with advmod connecting to the clause head.

gossebouma commented 8 months ago

The fact that sentence initial 'wat' triggers an exclamative interpretation is what the examples have in common, and indeed it is this effect that could make you think there is a single phenomenon at work here.

Sentence types are of course not annotated in UD, but there has been a bit of discussion about this recently, ie adding them might make the corpora more useful for typology. A recent Dagstuhl seminar discussed this issue in some detail, (see WG 9, pg 57, for an overview of proposed types)

jnivre commented 7 months ago

The "fixed" analysis doesn't seem quite right to me, simply because it doesn't seem to be a fixed expression, but rather a semi-productive construction (in the construction grammar sense) with very restricted applicability. This is assuming that all the different examples discussed are instances of the same use of "wat". UD doesn't have a good mechanism for representing these "almost fixed" expression, and I think many treebanks over-extend the "fixed" relation as the closest approximation. I know this is the case for our Swedish treebank (although not for this specific type of example as far as I can remember).