Open ftyers opened 6 years ago
In coref_tab.tab, lemma means the entity of the lemma, which is the lemma of the head. The rules look fine, I think the problem is the parse: in UD, the first token is the 'head', so the lemma of the full name is Рабиндранат not Тагор.
One solution is to use DepEdit to change the tree (I do this in English to make IBM Corp. have IBM as the head). But for people's names, the better way is to just list them in names.tab (maybe not in entities, and definitely not in entity_heads). If something is identified as PERSON_DEF_ENTITY and is also in names.tab, xrenner will consider first to full name and last to full name matches in general. Not sure if lemma is checked or only form though, so it's possible Russian inflection will break this.
Does that solve the problem?
If only the form is checked then yes the morphology will cause a problem. For languages with nominal morphology, it is likely to be problematic if the entities.tab/entity_heads.tab files only work on the surface form.
I tried changing the tree manually in the .conllu file:
# sent_id = 1
# text = Однажды Пушкин написал письмо Рабиндранату Тагору.
1 Однажды однажды ADV _ Degree=Pos 3 advmod _ _
2 Пушкин Пушкин PROPN _ Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing 3 nsubj _ _
3 написал писать VERB _ Aspect=Perf|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act 0 root _ _
4 письмо письмо NOUN _ Animacy=Inan|Case=Acc|Gender=Neut|Number=Sing 3 obj _ _
5 Рабиндранату Рабиндранат PROPN _ Animacy=Anim|Case=Dat|Gender=Masc|Number=Sing 6 flat:name _ _
6 Тагору Тагор PROPN _ Animacy=Anim|Case=Dat|Gender=Masc|Number=Sing 3 obl _ SpaceAfter=No
7 . . PUNCT _ _ 6 punct _ SpacesAfter=\s\n
But get the same result.
I'm trying to write a rule which corefers surnames with full instances of those names. The question is, in the entities file is it possible to indicate the lemma of the entity or just the form ?
Relevant parts of the conllu file:
In the
entities.tab
file I have:And in the
entity_heads.tab
file I have:And then in the
coref_rules.tab
file I have:Which I think says "if the antecedent is a proper noun and the anaphor is a proper noun and the lemma is the same, reading backwards in the document, corefer them up to 100 sentences back and don't propagate any agreement features".
In the
config.ini
I have: