Open Pozdniakov opened 4 years ago
The existing API function spoken-text-by-character
will not do the trick? It returns text spoken by character, but not separated by acts or scenes, it's always all the text uttered by each character throughout the whole play (in JSON or CSV).
So, do we additionally need a chronological account of what characters say as your example suggests?
As for stage directions, it is a tricky question which character a stage direction can be assigned to. So I would leave them out of this API function for now. It would be nice to have this information encoded in the TEI at some point, but for the time being we don't have this information.
It is somewhat encoded in TEI for now: one can say that if
I think that it can be assigned based on whether <stage>
node is inside or outside of <sp>
node (I am not sure in terminology, it is called node, right?).
For example:
<stage>Входит дворник Яков.</stage>
<sp who="#yakov">
<speaker>Яков.</speaker>
<p>Что задумались, Василиса Петровна? Я пришел.</p>
</sp>
Play | Act | Character | Line | Stage_Directions | Line_with_stage_directions | Type |
---|---|---|---|---|---|---|
andreev-ne-ubiy | ДЕЙСТВИЕ ПЕРВОЕ | Входит дворник Яков. | stage_direction |
<sp who="#yakov">
<speaker>Яков.</speaker>
<p>Согреешься!</p>
<stage>Вынимает из кармана полбутылки водки и небольшой кабацкий и исщербленный по краям
стаканчик, наливает и подносит Василисе Петровне.</stage>
<p>Выкушайте на здоровье, Василиса Петровна.</p>
</sp>
Play | Act | Character | Line | Stage_Directions | Line_with_stage_directions | Type |
---|---|---|---|---|---|---|
andreev-ne-ubiy | ДЕЙСТВИЕ ПЕРВОЕ | Василиса Петровна | Да вот думаю все. | (не поднимая головы и не меняя позы) | (не поднимая головы и не меняя позы) Да вот думаю все. | line |
This way seems rational to me.
I think that it would be very helpful to be able to download text for a play using API in a table format. Something like this:
It can be either json or csv - it is not a problem. In addition, there could be author, title and corpus name, but it is not neccessary - it can be just merged with a play metadata in further analysis.
This API command will allow easy text analysis that can be both easily tokenized (by words) and attributed to a character and scenes, preserving an original lines sequence.
What can be a problem - stage directions. Indeed, I don't have a straightforward solution for that. What makes it even more complicated is having different kinds of stage directions:
I have a few solutions, but all of them seem to be suboptimal for me. One idea is just omit stage directions. Another one is to do it this way:
So, two types of stage directions are considered: inside node and outside of . Inside stage directions come along with character's speech, outside stage directions have their own row, meaning that order of the lines is preserved.
This table can be easily processed to solve many basic tasks: extracting all stage directions, extracting stage directions for specific character, extracting character's speech with or without stage directions, calculation of summary statistics for each scene etc.