UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
197 stars 41 forks source link

ADJs with Number #525

Closed nschneid closed 1 month ago

nschneid commented 2 months ago

Browsing the comparison of English treebanks, I noticed something odd: a handful of ADJs in EWT, and lots in GUM and GENTLE, have Number=Sing. LinES, PUD, ParTUT as well. I suppose this arose from a bug at some point in a pipeline.

amir-zeldes commented 2 months ago

This is due to pos=NNP and the like, see amir-zeldes/gum#186 for discussion

nschneid commented 2 months ago

Also some ADVs in GUM and one in EWT.

AngledLuffa commented 1 week ago

Looked into this regarding PUD. There are 4 cases.

Weirdest is

# newdoc id = w04006
# sent_id = w04006023
# text = Historian David Crouch suggests that Stephen abandoned from the challenge around this time to focus on other issues.
1       Historian       historian       ADJ     NN      Number=Sing     2       amod    2:amod  _

so in this one, the job has become an ADJ! I don't like that. I found multiple examples of author in a similar role in PTB:

    (NP-SBJ (NN baseball) (NN author) (NNP Lawrence) (NNP Ritter) )
      (NP (NN Author) (NNP Dashiell) (NNP Hammett) )
          (NP (NN author) (NNP William) (NNP Buckley) )))

There is however a similar example in EWT with ADJ:

# sent_id = newsgroup-groups.google.com_humanities.lit.authors.shakespeare_0018a7697318f71f_ENG_20031006_163200-0023
# text = Historian John Stow dies: April 6, 1605 Sat/Wed.
1       Historian       historian       ADJ     JJ      Degree=Pos      2       amod    2:amod  _

then again, also from EWT

# sent_id = weblog-blogspot.com_dakbangla_20050311135387_ENG_20050311_135387-0169
# text = A report by the Center for Disease Control of interviews with AMI employees (as well as detailed interviews by author Leonard Cole) supports the
21      author  author  NOUN    NN      Number=Sing     22      compound        22:compound     _
22      Leonard Leonard PROPN   NNP     Number=Sing     19      nmod    19:nmod:by      _
23      Cole    Cole    PROPN   NNP     Number=Sing     22      flat    22:flat SpaceAfter=No

# sent_id = reviews-127252-0002
# newpar id = reviews-127252-p0002
# text = I've had writer friends describe horror stories with their printers.
4       writer  writer  NOUN    NN      Number=Sing     5       compound        5:compound      _
5       friends friend  NOUN    NNS     Number=Plur     3       obj     3:obj|6:nsubj:xsubj     _

so my interpretation is that someone's profession as a title should be a NOUN, not an ADJ

Others are

# newdoc id = n01031
# sent_id = n01031005
# text = Researchers have been investigating potential for male hormonal contraceptives for around 20 years.
7       male    male    ADJ     NN      Number=Sing      9       amod    9:amod  _

this follows male cats from EWT which is tagged ADJ with Degree=Pos

# sent_id = n01050014
# text = It's possible to have normal hemoglobin levels, but to have low iron stores overall, says Canadian Blood Services (CBS).
19      Canadian        Canadian        ADJ     NNP     Number=Sing     21      amod    21:amod _
20      Blood   Blood   PROPN   NNP     Number=Sing     21      compound        21:compound     _
21      Services        Services        PROPN   NNPS    Number=Plur     18      nsubj   18:nsubj        _

similar to Canadian Immigration Lawyers, also Degree=Pos

and then

# sent_id = w01045003
# text = After the discovery of America by Christopher Columbus in 1492, the Spanish term Antillas applied to the lands
13      Spanish Spanish ADJ     NNP     Number=Sing     14      amod    14:amod _
14      term    term    NOUN    NN      Number=Sing     16      nsubj   16:nsubj        _
15      Antillas        Antillas        PROPN   NNP     Number=Sing     14      appos   14:appos        _

This one I'm a little unclear on. Is this not a case of Spanish being used as a noun? I think this should also be tagged NOUN as opposed to ADJ. Compare to this other example from PUD

# sent_id = w05006058
# text = On the other hand, external history contains references to the history of Spanish speakers
14      Spanish Spanish PROPN   NNP     Number=Sing     15      compound        15:compound     _
15      speakers        speaker NOUN    NNS     Number=Plur     12      nmod    12:nmod:of      SpaceAfter=No

but maybe Spanish term becomes an ADJ usage?

AngledLuffa commented 1 week ago

Incidentally, what is the genesis of the tags in PUD? Is it kosher to change the XPOS when they are wrong? (male_NN contraceptives)

nschneid commented 1 week ago

"historian" as ADJ is an error. I suspect a tagger assigned it based on -ian ending, which can appear on adjectives.

"Spanish" seems correct as PROPN when naming the language and as ADJ when used as a property ('pertaining to Spain'). Geopolitical, ethnic, and religious identifies often give rise to proper adjectives.

I would go with:

Spain: PROPN Spaniard: PROPN Spanish: PROPN if denoting the language, ADJ otherwise

Canada: PROPN Canadian: PROPN if denoting a person from Canada, ADJ otherwise

French: PROPN for the language and "the French", ADJ otherwise Frenchman, Francophone: PROPN

(There are frameworks where an NP can be derived from an adjective head in the syntax, so even "the French" would be an adjective, but that seems like a stretch for UD.)

On Tue, Jun 25, 2024, 12:04 PM John Bauer @.***> wrote:

Looked into this regarding PUD. There are 4 cases.

Weirdest is

newdoc id = w04006

sent_id = w04006023

text = Historian David Crouch suggests that Stephen abandoned from the challenge around this time to focus on other issues.

1 Historian historian ADJ NN Number=Sing 2 amod 2:amod _

so in this one, the job has become an ADJ! I don't like that. I found multiple examples of author in a similar role in PTB:

(NP-SBJ (NN baseball) (NN author) (NNP Lawrence) (NNP Ritter) )
  (NP (NN Author) (NNP Dashiell) (NNP Hammett) )
      (NP (NN author) (NNP William) (NNP Buckley) )))

There is however a similar example in EWT with ADJ:

sent_id = newsgroup-groups.google.com_humanities.lit.authors.shakespeare_0018a7697318f71f_ENG_20031006_163200-0023

text = Historian John Stow dies: April 6, 1605 Sat/Wed.

1 Historian historian ADJ JJ Degree=Pos 2 amod 2:amod _

then again, also from EWT

sent_id = weblog-blogspot.com_dakbangla_20050311135387_ENG_20050311_135387-0169

text = A report by the Center for Disease Control of interviews with AMI employees (as well as detailed interviews by author Leonard Cole) supports the

21 author author NOUN NN Number=Sing 22 compound 22:compound 22 Leonard Leonard PROPN NNP Number=Sing 19 nmod 19:nmod:by 23 Cole Cole PROPN NNP Number=Sing 22 flat 22:flat SpaceAfter=No

sent_id = reviews-127252-0002

newpar id = reviews-127252-p0002

text = I've had writer friends describe horror stories with their printers.

4 writer writer NOUN NN Number=Sing 5 compound 5:compound 5 friends friend NOUN NNS Number=Plur 3 obj 3:obj|6:nsubj:xsubj

so my interpretation is that someone's profession as a title should be a NOUN, not an ADJ

Others are

newdoc id = n01031

sent_id = n01031005

text = Researchers have been investigating potential for male hormonal contraceptives for around 20 years.

7 male male ADJ NN Number=Sing 9 amod 9:amod _

this follows male cats from EWT which is tagged ADJ with Degree=Pos

sent_id = n01050014

text = It's possible to have normal hemoglobin levels, but to have low iron stores overall, says Canadian Blood Services (CBS).

19 Canadian Canadian ADJ NNP Number=Sing 21 amod 21:amod 20 Blood Blood PROPN NNP Number=Sing 21 compound 21:compound 21 Services Services PROPN NNPS Number=Plur 18 nsubj 18:nsubj _

similar to Canadian Immigration Lawyers, also Degree=Pos

and then

sent_id = w01045003

text = After the discovery of America by Christopher Columbus in 1492, the Spanish term Antillas applied to the lands

13 Spanish Spanish ADJ NNP Number=Sing 14 amod 14:amod 14 term term NOUN NN Number=Sing 16 nsubj 16:nsubj 15 Antillas Antillas PROPN NNP Number=Sing 14 appos 14:appos _

This one I'm a little unclear on. Is this not a case of Spanish being used as a noun? I think this should also be tagged NOUN as opposed to ADJ. Compare to this other example from PUD

sent_id = w05006058

text = On the other hand, external history contains references to the history of Spanish speakers

14 Spanish Spanish PROPN NNP Number=Sing 15 compound 15:compound _ 15 speakers speaker NOUN NNS Number=Plur 12 nmod 12:nmod:of SpaceAfter=No

but maybe Spanish term becomes an ADJ usage?

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-EWT/issues/525#issuecomment-2189357867, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHQRL5DPKBP5EB3CRTNJQ3ZJGIHVAVCNFSM6AAAAABHH3HH5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGM2TOOBWG4 . You are receiving this because you modified the open/close state.Message ID: @.*** com>

nschneid commented 1 week ago

"Are you male or female?" (no article) suggests "male" and "female" can be adjectives.

No idea about PUD but in EWT we do fix xpos errors.

On Tue, Jun 25, 2024, 12:21 PM Nathan Schneider < @.***> wrote:

"historian" as ADJ is an error. I suspect a tagger assigned it based on -ian ending, which can appear on adjectives.

"Spanish" seems correct as PROPN when naming the language and as ADJ when used as a property ('pertaining to Spain'). Geopolitical, ethnic, and religious identifies often give rise to proper adjectives.

I would go with:

Spain: PROPN Spaniard: PROPN Spanish: PROPN if denoting the language, ADJ otherwise

Canada: PROPN Canadian: PROPN if denoting a person from Canada, ADJ otherwise

French: PROPN for the language and "the French", ADJ otherwise Frenchman, Francophone: PROPN

(There are frameworks where an NP can be derived from an adjective head in the syntax, so even "the French" would be an adjective, but that seems like a stretch for UD.)

On Tue, Jun 25, 2024, 12:04 PM John Bauer @.***> wrote:

Looked into this regarding PUD. There are 4 cases.

Weirdest is

newdoc id = w04006

sent_id = w04006023

text = Historian David Crouch suggests that Stephen abandoned from the challenge around this time to focus on other issues.

1 Historian historian ADJ NN Number=Sing 2 amod 2:amod _

so in this one, the job has become an ADJ! I don't like that. I found multiple examples of author in a similar role in PTB:

(NP-SBJ (NN baseball) (NN author) (NNP Lawrence) (NNP Ritter) )
  (NP (NN Author) (NNP Dashiell) (NNP Hammett) )
      (NP (NN author) (NNP William) (NNP Buckley) )))

There is however a similar example in EWT with ADJ:

sent_id = newsgroup-groups.google.com_humanities.lit.authors.shakespeare_0018a7697318f71f_ENG_20031006_163200-0023

text = Historian John Stow dies: April 6, 1605 Sat/Wed.

1 Historian historian ADJ JJ Degree=Pos 2 amod 2:amod _

then again, also from EWT

sent_id = weblog-blogspot.com_dakbangla_20050311135387_ENG_20050311_135387-0169

text = A report by the Center for Disease Control of interviews with AMI employees (as well as detailed interviews by author Leonard Cole) supports the

21 author author NOUN NN Number=Sing 22 compound 22:compound 22 Leonard Leonard PROPN NNP Number=Sing 19 nmod 19:nmod:by 23 Cole Cole PROPN NNP Number=Sing 22 flat 22:flat SpaceAfter=No

sent_id = reviews-127252-0002

newpar id = reviews-127252-p0002

text = I've had writer friends describe horror stories with their printers.

4 writer writer NOUN NN Number=Sing 5 compound 5:compound 5 friends friend NOUN NNS Number=Plur 3 obj 3:obj|6:nsubj:xsubj

so my interpretation is that someone's profession as a title should be a NOUN, not an ADJ

Others are

newdoc id = n01031

sent_id = n01031005

text = Researchers have been investigating potential for male hormonal contraceptives for around 20 years.

7 male male ADJ NN Number=Sing 9 amod 9:amod _

this follows male cats from EWT which is tagged ADJ with Degree=Pos

sent_id = n01050014

text = It's possible to have normal hemoglobin levels, but to have low iron stores overall, says Canadian Blood Services (CBS).

19 Canadian Canadian ADJ NNP Number=Sing 21 amod 21:amod 20 Blood Blood PROPN NNP Number=Sing 21 compound 21:compound 21 Services Services PROPN NNPS Number=Plur 18 nsubj 18:nsubj _

similar to Canadian Immigration Lawyers, also Degree=Pos

and then

sent_id = w01045003

text = After the discovery of America by Christopher Columbus in 1492, the Spanish term Antillas applied to the lands

13 Spanish Spanish ADJ NNP Number=Sing 14 amod 14:amod 14 term term NOUN NN Number=Sing 16 nsubj 16:nsubj 15 Antillas Antillas PROPN NNP Number=Sing 14 appos 14:appos _

This one I'm a little unclear on. Is this not a case of Spanish being used as a noun? I think this should also be tagged NOUN as opposed to ADJ. Compare to this other example from PUD

sent_id = w05006058

text = On the other hand, external history contains references to the history of Spanish speakers

14 Spanish Spanish PROPN NNP Number=Sing 15 compound 15:compound _ 15 speakers speaker NOUN NNS Number=Plur 12 nmod 12:nmod:of SpaceAfter=No

but maybe Spanish term becomes an ADJ usage?

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-EWT/issues/525#issuecomment-2189357867, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHQRL5DPKBP5EB3CRTNJQ3ZJGIHVAVCNFSM6AAAAABHH3HH5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGM2TOOBWG4 . You are receiving this because you modified the open/close state.Message ID: @.*** com>

AngledLuffa commented 1 week ago

SGTM. I'll merge that change then, since that PR and your recommendations match. I'll also submit a PR for EWT's "historian" token

AngledLuffa commented 1 week ago

Hmm, suddenly I'm less convinced about Spanish term Antillas in PUD upon trying to rearrange the dependencies to match retagging. It really feels like term or maybe Antillas wants to be the head. Here's the current parsing

12      the     the     DET     DT      Definite=Def|PronType=Art       14      det     14:det  _
13      Spanish Spanish ADJ   NNP     Number=Sing     14      amod    14:amod _
14      term    term    NOUN    NN      Number=Sing     16      nsubj   16:nsubj        _
15      Antillas        Antillas        PROPN   NNP     Number=Sing     14      appos   14:appos        _
16      applied apply   VERB    VBD     Mood=Ind|Tense=Past|VerbForm=Fin        0       root    0:root  _
17      to      to      ADP     IN      _       19      case    19:case _
18      the     the     DET     DT      Definite=Def|PronType=Art       19      det     19:det  _
19      lands   land    NOUN    NNS     Number=Plur     16      obl     16:obl:to       SpaceAfter=No

So term is the head. I suppose we could make the dependency an nmod from Spanish to term and keep term the head of that phrase

nschneid commented 1 week ago

appos is correct: "the Spanish term" and "Antillas" are two full noun phrases that have the same referent and can be swapped.

Within "the Spanish term", "term" is correct as the head. If "Spanish" is tagged as PROPN then it should attach as compound. I don't know if GUM or EWT has a precedent for a language name as attributive modifier ("the French language", "a German word" etc.). Usually language names are nominal heads.

AngledLuffa commented 1 week ago

I didn't do an exhaustive search over languages, but I didn't find any other examples. I can make it a compound

edit; but I suppose that means we need to be happy with Spanish_PROPN. Does that sound right?

nschneid commented 1 week ago

This example from the PTB guidelines suggests that languages are always PROPN even if attributive:

image