Add API function to extract text for a play in table format

Pozdniakov commented 4 years ago

I think that it would be very helpful to be able to download text for a play using API in a table format. Something like this:

Play	Act	Character	Line
andreev-ne-ubiy	ДЕЙСТВИЕ ПЕРВОЕ	Яков	Что задумались, Василиса Петровна? Я пришел.
andreev-ne-ubiy	ДЕЙСТВИЕ ПЕРВОЕ	Василиса Петровна	Да вот думаю все.

It can be either json or csv - it is not a problem. In addition, there could be author, title and corpus name, but it is not neccessary - it can be just merged with a play metadata in further analysis.

This API command will allow easy text analysis that can be both easily tokenized (by words) and attributed to a character and scenes, preserving an original lines sequence.

What can be a problem - stage directions. Indeed, I don't have a straightforward solution for that. What makes it even more complicated is having different kinds of stage directions:

between characters' lines (in the beginning, for example)

<stage>Входит дворник Яков.</stage>
      <sp who="#yakov">
        <speaker>Яков.</speaker>
        <p>Что задумались, Василиса Петровна? Я пришел.</p>
      </sp>

near character's name:

<sp who="#vasilisa_petrovna">
            <speaker>Василиса Петровна</speaker>
            <stage>(вздыхает).</stage>
            <p>Ну, давай. Думала ли я когда-нибудь, Яшенька, что буду вот так сидеть... с
            Яшей-дворником и водку пить. Много мне гадалки гадали, а такого случая ни одна угадать
            не могла. Ух, как холодно, руки, ноги болят.</p>
          </sp>

inside character's line:

<sp who="#yakov">
        <speaker>Яков.</speaker>
        <p>Согреешься!</p>
        <stage>Вынимает из кармана полбутылки водки и небольшой кабацкий и исщербленный по краям
        стаканчик, наливает и подносит Василисе Петровне.</stage>
        <p>Выкушайте на здоровье, Василиса Петровна.</p>
      </sp>

I have a few solutions, but all of them seem to be suboptimal for me. One idea is just omit stage directions. Another one is to do it this way:

Play	Act	Character	Line	Stage_Directions	Line_with_stage_directions	Type
andreev-ne-ubiy	ДЕЙСТВИЕ ПЕРВОЕ			Входит дворник Яков.		stage_direction
andreev-ne-ubiy	ДЕЙСТВИЕ ПЕРВОЕ	Яков	Что задумались, Василиса Петровна? Я пришел.			line
andreev-ne-ubiy	ДЕЙСТВИЕ ПЕРВОЕ	Василиса Петровна	Да вот думаю все.	(не поднимая головы и не меняя позы)	(не поднимая головы и не меняя позы) Да вот думаю все.	line

So, two types of stage directions are considered: inside node and outside of . Inside stage directions come along with character's speech, outside stage directions have their own row, meaning that order of the lines is preserved.

This table can be easily processed to solve many basic tasks: extracting all stage directions, extracting stage directions for specific character, extracting character's speech with or without stage directions, calculation of summary statistics for each scene etc.

lehkost commented 4 years ago

The existing API function spoken-text-by-character will not do the trick? It returns text spoken by character, but not separated by acts or scenes, it's always all the text uttered by each character throughout the whole play (in JSON or CSV).

So, do we additionally need a chronological account of what characters say as your example suggests?

As for stage directions, it is a tricky question which character a stage direction can be assigned to. So I would leave them out of this API function for now. It would be nice to have this information encoded in the TEI at some point, but for the time being we don't have this information.

Pozdniakov commented 4 years ago

It is somewhat encoded in TEI for now: one can say that if I think that it can be assigned based on whether <stage> node is inside or outside of <sp> node (I am not sure in terminology, it is called node, right?).

For example:

Outside, so it should be read as a separate line without a character:

<stage>Входит дворник Яков.</stage>
      <sp who="#yakov">
        <speaker>Яков.</speaker>
        <p>Что задумались, Василиса Петровна? Я пришел.</p>
      </sp>

Play	Act	Character	Line	Stage_Directions	Line_with_stage_directions	Type
andreev-ne-ubiy	ДЕЙСТВИЕ ПЕРВОЕ			Входит дворник Яков.		stage_direction

Inside node, so it should be read as a part of a line for a character:

<sp who="#yakov">
            <speaker>Яков.</speaker>
            <p>Согреешься!</p>
            <stage>Вынимает из кармана полбутылки водки и небольшой кабацкий и исщербленный по краям
            стаканчик, наливает и подносит Василисе Петровне.</stage>
            <p>Выкушайте на здоровье, Василиса Петровна.</p>
          </sp>

Play	Act	Character	Line	Stage_Directions	Line_with_stage_directions	Type
andreev-ne-ubiy	ДЕЙСТВИЕ ПЕРВОЕ	Василиса Петровна	Да вот думаю все.	(не поднимая головы и не меняя позы)	(не поднимая головы и не меняя позы) Да вот думаю все.	line

This way seems rational to me.

dracor-org / dracor-api

Add API function to extract text for a play in table format #92