UniversalDependencies / UD_Coptic-Scriptorium

Other
3 stars 1 forks source link

Separation between named and non-named entities ? #8

Open thangld201 opened 3 months ago

thangld201 commented 3 months ago

Hi @dan-zeman, are the entity labels separated between named and non-named ones ?

dan-zeman commented 3 months ago

I think this corpus has named entities. But this question is really for @amir-zeldes.

amir-zeldes commented 2 months ago

This corpus has both named and non-named entities, as well as entity linking (Wikification) for some of the named entities (the ones that have corresponding Wikipedia articles). Basically, any nominal mention that is not a personal pronoun has a typed span with one of the 10 entity types used in the corpus, including nested spans. So to use an English example, we would have spans for the following (notice pronouns are ignored):

He told me, that is [Apa Paphnoutius]PERSON_Paphnutius_of_Thebes, that he lived on [the holy mountain near [the city of Rakote]PLACE_Alexandria]PLACE

You can tell if an entity is named based on whether the head token (the one with an incoming dependency from outside the span) is tagged PROPN.

BTW there is a lot more NER tagged data for Coptic without gold UD trees here:

https://github.com/CopticScriptorium/corpora

Some of it is manually corrected and some of it is automatic (see metadata).

thangld201 commented 2 months ago

Thanks @amir-zeldes, so by definition, did you treat all entities without PROPN tag as non-named entities ? I think this is true for PERSON, PLACE and ORG but very likely different for TIME (I pasted a sample below), etc.

Also, I am not sure how to determine the head token of an entity span .. Is it fine to just take the first token in an entity span ? Or do you have other suggestions ? I am not familiar with the data so I'm not sure how to detect the one with incoming dependency from outside the span ....

1-2 ϯϣⲡϩⲙⲟⲧ _   _   _   _   _   _   _   _
1   ϯ   ⲁⲛⲟⲕ    PRON    PPERS   Definite=Def|Number=Sing|Person=1|PronType=Prs  2   nsubj   _   _
2   ϣⲡϩⲙⲟⲧ  ϣⲡϩⲙⲟⲧ  VERB    V   VerbForm=Fin    0   root    _   Morphs=ϣⲡ-ϩⲙⲟⲧ
3-5 ⲛⲧⲙⲡⲁⲛⲟⲩⲧⲉ  _   _   _   _   _   _   _   _
3   ⲛⲧⲙ ⲛⲧⲛ ADP PREP    _   5   case    _   _
4   ⲡⲁ  ⲡⲁ  DET PPOS    Definite=Def|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs   5   det _   Entity=(person
5   ⲛⲟⲩⲧⲉ   ⲛⲟⲩⲧⲉ   NOUN    N   _   2   obl _   Entity=person)
6-7 ⲛⲟⲩⲟⲉⲓϣ _   _   _   _   _   _   _   _
6   ⲛ   ⲛ   ADP PREP    _   7   case    _   _
7   ⲟⲩⲟⲉⲓϣ  ⲟⲩⲟⲉⲓϣ  NOUN    N   _   2   obl _   Entity=(time
8   ⲛⲓⲙ ⲛⲓⲙ PRON    PINT    PronType=Ind    7   det _   Entity=time)
9-10    ϩⲁⲣⲱⲧⲛ  _   _   _   _   _   _   _   _
9   ϩⲁⲣⲱ    ϩⲁ  ADP PREP    _   10  case    _   _
10  ⲧⲛ  ⲛⲧⲱⲧⲛ   PRON    PPERO   Definite=Def|Number=Plur|Person=2|PronType=Prs  2   obl _   _
11  ⲉϩⲣⲁⲓ   ⲉϩⲣⲁⲓ   ADV ADV _   2   advmod  _   _
12-14   ⲉϫⲛⲧⲉⲭⲁⲣⲓⲥ  _   _   _   _   _   _   _   _
12  ⲉϫⲛ ⲉϫⲛ ADP PREP    _   14  case    _   _
13  ⲧⲉ  ⲡ   DET ART Definite=Def|Gender=Fem|Number=Sing|PronType=Art    14  det _   Entity=(abstract
14  ⲭⲁⲣⲓⲥ   ⲭⲁⲣⲓⲥ   NOUN    N   Foreign=Yes 2   obl _   OrigLang=grc
15-17   ⲙⲡⲛⲟⲩⲧⲉ _   _   _   _   _   _   _   _
15  ⲙ   ⲛ   ADP PREP    _   17  case    _   _
16  ⲡ   ⲡ   DET ART Definite=Def|Gender=Masc|Number=Sing|PronType=Art   17  det _   Entity=(person
17  ⲛⲟⲩⲧⲉ   ⲛⲟⲩⲧⲉ   NOUN    N   _   14  nmod    _   Entity=person)
18  ⲧⲁⲓ ⲡⲁⲓ DET PDEM    Definite=Def|Gender=Fem|Number=Sing|PronType=Dem    14  appos   _   _
19-23   ⲉⲛⲧⲁⲩⲧⲁⲁⲥ   _   _   _   _   _   _   _   _
19  ⲉⲛⲧ ⲉⲧⲉⲣⲉ   SCONJ   CREL    _   22  mark    _   _
20  ⲁ   ⲁ   AUX APST    _   22  aux _   _
21  ⲩ   ⲛⲧⲟⲟⲩ   PRON    PPERS   Definite=Def|Number=Plur|Person=3|PronType=Prs  22  nsubj   _   _
22  ⲧⲁⲁ ϯ   VERB    V   VerbForm=Fin    18  acl:relcl   _   _
23  ⲥ   ⲛⲧⲟⲥ    PRON    PPERO   Definite=Def|Gender=Fem|Number=Sing|Person=3|PronType=Prs   22  obj _   _
24-25   ⲛⲏⲧⲛ    _   _   _   _   _   _   _   _
24  ⲛⲏ  ⲛⲁ  ADP PREP    _   25  case    _   _
25  ⲧⲛ  ⲛⲧⲱⲧⲛ   PRON    PPERO   Definite=Def|Number=Plur|Person=2|PronType=Prs  22  obl _   _
26-28   ϩⲙⲡⲉⲭⲣⲓⲥⲧⲟⲥ _   _   _   _   _   _   _   _
26  ϩⲙ  ϩⲛ  ADP PREP    _   28  case    _   _
27  ⲡⲉ  ⲡ   DET ART Definite=Def|Gender=Masc|Number=Sing|PronType=Art   28  det _   Entity=(person-Jesus
28  ⲭⲣⲓⲥⲧⲟⲥ ⲭⲣⲓⲥⲧⲟⲥ NOUN    N   Foreign=Yes 22  obl _   OrigLang=grc
29  ⲓⲏⲥⲟⲩⲥ  ⲓⲏⲥⲟⲩⲥ  PROPN   NPROP   Foreign=Yes 28  appos   _   Entity=person-Jesus)abstract)|OrigLang=he
30  .   .   PUNCT   PUNCT   _   2   punct   _   _
amir-zeldes commented 2 months ago

did you treat all entities without PROPN tag as non-named entities ?

Yes, that's correct - all mentions of any referring expressions are annotated, except for personal pronouns.

very likely different for TIME (I pasted a sample below)

No, times are no exception, and indeed the example you have above has a non-named time entity, marked with brackets in the translation here:

Named times are rarer, but you can find some, mostly month names or holiday names. You can find examples using our ANNIS search interface and the Coptic xpos tag N/NPROP:

I am not sure how to determine the head token of an entity span .. Is it fine to just take the first token in an entity span ?

No, the first token is usually an article like "the" or "a", so it's not a good way to find named entities. Finding the head is pretty deterministic though, just do this:

  1. For each entity annotation, find its first and last tokens in the sentence
  2. For each token in that span from left to right, check the HEAD column (column 6). If it's 0 or a number smaller than the first token ID in the span, or a number larger than the last token in the span: that is the head. If not, keep traversing the tokens.
  3. If that token is tagged PROPN, the entity is named, otherwise not.

This algorithm always terminates due to the properties of dependency trees (a span is either dominated from outside or it contains the root, with head 0). It's also nearly always correct, with very rare exceptions in cases of interrupted phrases. This illustration maybe explains the intuition better:

image

Since "ran" is outside the entity span, the token which has it as head, "fox", is the head of "the big fox".

Another algorithm which could work for your use case is to find the smallest entity span surrounding each PROPN token and assuming that only these spans are named - this will have a nearly identical result.