Character Identification is an entity linking task that finds the global entity of each personal mention in multiparty dialogue. Let a mention be a nominal referring to a person (e.g., she, mom, Judy), and an entity be a character in a dialogue. The goal is to assign each mention to its entity, who may or may not participate in the dialogue. For the following example, the mention "mom" is not one of the speakers; nonetheless, it clearly refers to the specific person, Judy Geller, that could appear in some other dialogue. Identifying such mentions as real characters requires cross-document entity resolution, which makes this task challenging.
This task is a part of the Character Mining project led by the Emory NLP research group.
All personal mentions are annotated with their global entities. For the above example, the first mention "I" is annotated with its global entity, Ross Geller, and the second mention "mom" is annotated with, Judy Geller, and so on. The mention detection is first performed automatically then corrected manually. The entity annotation is mostly crowdsourced although lots of them are fixed manually by experts.
For each season, episodes 1 ~ 19 are used for training (TRN), 20 ~ 21 for development (DEV), and 22 ~ rest for evaluation (TST).
Dataset | Episodes | Scenes | Utterances | Tokens | Speakers | Mentions | Entities |
---|---|---|---|---|---|---|---|
TRN | 76 | 987 | 18,789 | 262,650 | 265 | 36,385 | 628 |
DEV | 8 | 122 | 2142 | 28523 | 48 | 3932 | 102 |
TST | 13 | 192 | 3,597 | 50,232 | 91 | 7,050 | 165 |
Total | 97 | 1,301 | 24,528 | 341,405 | 331 | 47,367 | 781 |
Each utterance is split into sentences and personal mentions in every sentence are annotated with their entities. For the example below, the utterance consists of one sentence including four mentions. The first three mentions, I, *mom and dad, are singular that refer to Ross Geller, Judy Geller and Jack Geller, respectively. The last mention, they, is plural that refers to both Judy Geller and Jack Geller.
{
"utterance_id": "s01_e01_c01_u039",
"speakers": ["Ross Geller"],
"transcript": "I told mom and dad last night, they seemed to take it pretty well.",
"tokens": [
["I", "told", "mom", "and", "dad", "last", "night", ",", "they", "seemed", "to", "take", "it", "pretty", "well", "."]
],
"character_entities": [
[[0, 1, "Ross Geller"], [2, 3, "Judy Geller"], [4, 5, "Jack Geller"], [8, 9, "Jack Geller", "Judy Geller"]]
]
}
Each mention is annotated by the following scheme:
[begin_index, end_index, entity(, entity)*]
begin_index: int
- the beginning token index of the mention (inclusive).end_index: int
- the ending token index of the mention (exclusive).entity: str
- the label of the entity.