amir-zeldes / xrenner

eXternally configurable REference and Non Named Entity Recognizer
Other
17 stars 11 forks source link

documentation unclear regarding form/lemma for entities #56

Open ftyers opened 6 years ago

ftyers commented 6 years ago

I'm trying to write a rule which corefers surnames with full instances of those names. The question is, in the entities file is it possible to indicate the lemma of the entity or just the form ?

 Однажды Пушкин написал письмо [Person Рабиндранату Тагору] . 
" Дорогой далекий друг , — писал он , — я Вас не знаю , и Вы меня не 
знаете . Очень хотелось бы познакомиться . Всего хорошего . Саша " . Когда 
письмо принесли , [Person Тагор] предавался самосозерцанию .

Relevant parts of the conllu file:

# text = Однажды Пушкин написал письмо Рабиндранату Тагору.
1       Однажды однажды ADV     _       Degree=Pos      3       advmod  _       _
2       Пушкин  Пушкин  PROPN   _       Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing   3       nsubj   _       _
3       написал писать  VERB    _       Aspect=Perf|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act  0       root    _       _
4       письмо  письмо  NOUN    _       Animacy=Inan|Case=Acc|Gender=Neut|Number=Sing   3       obj     _       _
5       Рабиндранату    Рабиндранат     PROPN   _       Animacy=Anim|Case=Dat|Gender=Masc|Number=Sing   3       obl     _       _
6       Тагору  Тагор   PROPN   _       Animacy=Anim|Case=Dat|Gender=Masc|Number=Sing   5       flat:name       _       SpaceAfter=No
7       .       .       PUNCT   _       _       6       punct   _       SpacesAfter=\s\n

...

# sent_id = 6
# text = Когда письмо принесли, Тагор предавался самосозерцанию.
1       Когда   когда   ADV     _       Degree=Pos      3       mark    _       _
2       письмо  письмо  NOUN    _       Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing   3       nsubj   _       _
3       принесли        приносить       VERB    _       Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act      6       advcl   _       SpaceAfter=No
4       ,       ,       PUNCT   _       _       3       punct   _       _
5       Тагор   Тагор   PROPN   _       Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing   6       nsubj   _       _
6       предавался      предаваться     VERB    _       Aspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Mid   0       root    _       _
7       самосозерцанию  самосозерцание  NOUN    _       Animacy=Inan|Case=Dat|Gender=Neut|Number=Sing   6       iobj    _       SpaceAfter=No
8       .       .       PUNCT   _       _       7       punct   _       SpacesAfter=\s\n

In the entities.tab file I have:

Пушкин  person  person/male
Рабиндранат Тагор       person  person/male
Рабиндранат     person  person/male
Тагор   person  person/male
Саша    person  person/male
Саша    person  person/female

And in the entity_heads.tab file I have:

Пушкин  person  person/male
Рабиндранат     person  person/male
Тагор   person  person/male
Саша    person  person/male
Саша    person  person/female
письмо  object  object
друг    друг    person

And then in the coref_rules.tab file I have:

form="proper";form="proper"&lemma=$1;100;nopropagate

Which I think says "if the antecedent is a proper noun and the anaphor is a proper noun and the lemma is the same, reading backwards in the document, corefer them up to 100 sentences back and don't propagate any agreement features".

In the config.ini I have:

# Parts of speech for proper nouns
proper_pos=/PROPN/
amir-zeldes commented 6 years ago

In coref_tab.tab, lemma means the entity of the lemma, which is the lemma of the head. The rules look fine, I think the problem is the parse: in UD, the first token is the 'head', so the lemma of the full name is Рабиндранат not Тагор.

One solution is to use DepEdit to change the tree (I do this in English to make IBM Corp. have IBM as the head). But for people's names, the better way is to just list them in names.tab (maybe not in entities, and definitely not in entity_heads). If something is identified as PERSON_DEF_ENTITY and is also in names.tab, xrenner will consider first to full name and last to full name matches in general. Not sure if lemma is checked or only form though, so it's possible Russian inflection will break this.

Does that solve the problem?

ftyers commented 6 years ago

If only the form is checked then yes the morphology will cause a problem. For languages with nominal morphology, it is likely to be problematic if the entities.tab/entity_heads.tab files only work on the surface form.

I tried changing the tree manually in the .conllu file:

# sent_id = 1
# text = Однажды Пушкин написал письмо Рабиндранату Тагору.
1       Однажды однажды ADV     _       Degree=Pos      3       advmod  _       _
2       Пушкин  Пушкин  PROPN   _       Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing   3       nsubj   _       _
3       написал писать  VERB    _       Aspect=Perf|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act  0       root    _       _
4       письмо  письмо  NOUN    _       Animacy=Inan|Case=Acc|Gender=Neut|Number=Sing   3       obj     _       _
5       Рабиндранату    Рабиндранат     PROPN   _       Animacy=Anim|Case=Dat|Gender=Masc|Number=Sing   6       flat:name       _       _
6       Тагору  Тагор   PROPN   _       Animacy=Anim|Case=Dat|Gender=Masc|Number=Sing   3       obl     _       SpaceAfter=No
7       .       .       PUNCT   _       _       6       punct   _       SpacesAfter=\s\n

But get the same result.