Open thangld201 opened 3 months ago
I think this corpus has named entities. But this question is really for @amir-zeldes.
This corpus has both named and non-named entities, as well as entity linking (Wikification) for some of the named entities (the ones that have corresponding Wikipedia articles). Basically, any nominal mention that is not a personal pronoun has a typed span with one of the 10 entity types used in the corpus, including nested spans. So to use an English example, we would have spans for the following (notice pronouns are ignored):
He told me, that is [Apa Paphnoutius]PERSON_Paphnutius_of_Thebes, that he lived on [the holy mountain near [the city of Rakote]PLACE_Alexandria]PLACE
You can tell if an entity is named based on whether the head token (the one with an incoming dependency from outside the span) is tagged PROPN
.
BTW there is a lot more NER tagged data for Coptic without gold UD trees here:
https://github.com/CopticScriptorium/corpora
Some of it is manually corrected and some of it is automatic (see metadata).
Thanks @amir-zeldes, so by definition, did you treat all entities without PROPN
tag as non-named entities ? I think this is true for PERSON, PLACE and ORG but very likely different for TIME (I pasted a sample below), etc.
Also, I am not sure how to determine the head token of an entity span .. Is it fine to just take the first token in an entity span ? Or do you have other suggestions ? I am not familiar with the data so I'm not sure how to detect the one with incoming dependency from outside the span
....
1-2 ϯϣⲡϩⲙⲟⲧ _ _ _ _ _ _ _ _
1 ϯ ⲁⲛⲟⲕ PRON PPERS Definite=Def|Number=Sing|Person=1|PronType=Prs 2 nsubj _ _
2 ϣⲡϩⲙⲟⲧ ϣⲡϩⲙⲟⲧ VERB V VerbForm=Fin 0 root _ Morphs=ϣⲡ-ϩⲙⲟⲧ
3-5 ⲛⲧⲙⲡⲁⲛⲟⲩⲧⲉ _ _ _ _ _ _ _ _
3 ⲛⲧⲙ ⲛⲧⲛ ADP PREP _ 5 case _ _
4 ⲡⲁ ⲡⲁ DET PPOS Definite=Def|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs 5 det _ Entity=(person
5 ⲛⲟⲩⲧⲉ ⲛⲟⲩⲧⲉ NOUN N _ 2 obl _ Entity=person)
6-7 ⲛⲟⲩⲟⲉⲓϣ _ _ _ _ _ _ _ _
6 ⲛ ⲛ ADP PREP _ 7 case _ _
7 ⲟⲩⲟⲉⲓϣ ⲟⲩⲟⲉⲓϣ NOUN N _ 2 obl _ Entity=(time
8 ⲛⲓⲙ ⲛⲓⲙ PRON PINT PronType=Ind 7 det _ Entity=time)
9-10 ϩⲁⲣⲱⲧⲛ _ _ _ _ _ _ _ _
9 ϩⲁⲣⲱ ϩⲁ ADP PREP _ 10 case _ _
10 ⲧⲛ ⲛⲧⲱⲧⲛ PRON PPERO Definite=Def|Number=Plur|Person=2|PronType=Prs 2 obl _ _
11 ⲉϩⲣⲁⲓ ⲉϩⲣⲁⲓ ADV ADV _ 2 advmod _ _
12-14 ⲉϫⲛⲧⲉⲭⲁⲣⲓⲥ _ _ _ _ _ _ _ _
12 ⲉϫⲛ ⲉϫⲛ ADP PREP _ 14 case _ _
13 ⲧⲉ ⲡ DET ART Definite=Def|Gender=Fem|Number=Sing|PronType=Art 14 det _ Entity=(abstract
14 ⲭⲁⲣⲓⲥ ⲭⲁⲣⲓⲥ NOUN N Foreign=Yes 2 obl _ OrigLang=grc
15-17 ⲙⲡⲛⲟⲩⲧⲉ _ _ _ _ _ _ _ _
15 ⲙ ⲛ ADP PREP _ 17 case _ _
16 ⲡ ⲡ DET ART Definite=Def|Gender=Masc|Number=Sing|PronType=Art 17 det _ Entity=(person
17 ⲛⲟⲩⲧⲉ ⲛⲟⲩⲧⲉ NOUN N _ 14 nmod _ Entity=person)
18 ⲧⲁⲓ ⲡⲁⲓ DET PDEM Definite=Def|Gender=Fem|Number=Sing|PronType=Dem 14 appos _ _
19-23 ⲉⲛⲧⲁⲩⲧⲁⲁⲥ _ _ _ _ _ _ _ _
19 ⲉⲛⲧ ⲉⲧⲉⲣⲉ SCONJ CREL _ 22 mark _ _
20 ⲁ ⲁ AUX APST _ 22 aux _ _
21 ⲩ ⲛⲧⲟⲟⲩ PRON PPERS Definite=Def|Number=Plur|Person=3|PronType=Prs 22 nsubj _ _
22 ⲧⲁⲁ ϯ VERB V VerbForm=Fin 18 acl:relcl _ _
23 ⲥ ⲛⲧⲟⲥ PRON PPERO Definite=Def|Gender=Fem|Number=Sing|Person=3|PronType=Prs 22 obj _ _
24-25 ⲛⲏⲧⲛ _ _ _ _ _ _ _ _
24 ⲛⲏ ⲛⲁ ADP PREP _ 25 case _ _
25 ⲧⲛ ⲛⲧⲱⲧⲛ PRON PPERO Definite=Def|Number=Plur|Person=2|PronType=Prs 22 obl _ _
26-28 ϩⲙⲡⲉⲭⲣⲓⲥⲧⲟⲥ _ _ _ _ _ _ _ _
26 ϩⲙ ϩⲛ ADP PREP _ 28 case _ _
27 ⲡⲉ ⲡ DET ART Definite=Def|Gender=Masc|Number=Sing|PronType=Art 28 det _ Entity=(person-Jesus
28 ⲭⲣⲓⲥⲧⲟⲥ ⲭⲣⲓⲥⲧⲟⲥ NOUN N Foreign=Yes 22 obl _ OrigLang=grc
29 ⲓⲏⲥⲟⲩⲥ ⲓⲏⲥⲟⲩⲥ PROPN NPROP Foreign=Yes 28 appos _ Entity=person-Jesus)abstract)|OrigLang=he
30 . . PUNCT PUNCT _ 2 punct _ _
did you treat all entities without PROPN tag as non-named entities ?
Yes, that's correct - all mentions of any referring expressions are annotated, except for personal pronouns.
very likely different for TIME (I pasted a sample below)
No, times are no exception, and indeed the example you have above has a non-named time entity, marked with brackets in the translation here:
Named times are rarer, but you can find some, mostly month names or holiday names. You can find examples using our ANNIS search interface and the Coptic xpos tag N/NPROP:
I am not sure how to determine the head token of an entity span .. Is it fine to just take the first token in an entity span ?
No, the first token is usually an article like "the" or "a", so it's not a good way to find named entities. Finding the head is pretty deterministic though, just do this:
0
or a number smaller than the first token ID in the span, or a number larger than the last token in the span: that is the head. If not, keep traversing the tokens.PROPN
, the entity is named, otherwise not.This algorithm always terminates due to the properties of dependency trees (a span is either dominated from outside or it contains the root, with head 0). It's also nearly always correct, with very rare exceptions in cases of interrupted phrases. This illustration maybe explains the intuition better:
Since "ran" is outside the entity span, the token which has it as head, "fox", is the head of "the big fox".
Another algorithm which could work for your use case is to find the smallest entity span surrounding each PROPN token and assuming that only these spans are named - this will have a nearly identical result.
Hi @dan-zeman, are the entity labels separated between named and non-named ones ?