dracor-org / dracor-api

eXistdb application for dracor.org
MIT License
10 stars 2 forks source link

Add API function to extract text for a play in table format #92

Open Pozdniakov opened 4 years ago

Pozdniakov commented 4 years ago

I think that it would be very helpful to be able to download text for a play using API in a table format. Something like this:

Play Act Character Line
andreev-ne-ubiy ДЕЙСТВИЕ ПЕРВОЕ Яков Что задумались, Василиса Петровна? Я пришел.
andreev-ne-ubiy ДЕЙСТВИЕ ПЕРВОЕ Василиса Петровна Да вот думаю все.

It can be either json or csv - it is not a problem. In addition, there could be author, title and corpus name, but it is not neccessary - it can be just merged with a play metadata in further analysis.

This API command will allow easy text analysis that can be both easily tokenized (by words) and attributed to a character and scenes, preserving an original lines sequence.

What can be a problem - stage directions. Indeed, I don't have a straightforward solution for that. What makes it even more complicated is having different kinds of stage directions:

<sp who="#vasilisa_petrovna">
            <speaker>Василиса Петровна</speaker>
            <stage>(вздыхает).</stage>
            <p>Ну, давай. Думала ли я когда-нибудь, Яшенька, что буду вот так сидеть... с
            Яшей-дворником и водку пить. Много мне гадалки гадали, а такого случая ни одна угадать
            не могла. Ух, как холодно, руки, ноги болят.</p>
          </sp> 

I have a few solutions, but all of them seem to be suboptimal for me. One idea is just omit stage directions. Another one is to do it this way:

Play Act Character Line Stage_Directions Line_with_stage_directions Type
andreev-ne-ubiy ДЕЙСТВИЕ ПЕРВОЕ Входит дворник Яков. stage_direction
andreev-ne-ubiy ДЕЙСТВИЕ ПЕРВОЕ Яков Что задумались, Василиса Петровна? Я пришел. line
andreev-ne-ubiy ДЕЙСТВИЕ ПЕРВОЕ Василиса Петровна Да вот думаю все. (не поднимая головы и не меняя позы) (не поднимая головы и не меняя позы) Да вот думаю все. line

So, two types of stage directions are considered: inside node and outside of . Inside stage directions come along with character's speech, outside stage directions have their own row, meaning that order of the lines is preserved.

This table can be easily processed to solve many basic tasks: extracting all stage directions, extracting stage directions for specific character, extracting character's speech with or without stage directions, calculation of summary statistics for each scene etc.

lehkost commented 4 years ago

The existing API function spoken-text-by-character will not do the trick? It returns text spoken by character, but not separated by acts or scenes, it's always all the text uttered by each character throughout the whole play (in JSON or CSV).

So, do we additionally need a chronological account of what characters say as your example suggests?

As for stage directions, it is a tricky question which character a stage direction can be assigned to. So I would leave them out of this API function for now. It would be nice to have this information encoded in the TEI at some point, but for the time being we don't have this information.

Pozdniakov commented 4 years ago

It is somewhat encoded in TEI for now: one can say that if I think that it can be assigned based on whether <stage> node is inside or outside of <sp> node (I am not sure in terminology, it is called node, right?).

For example:

Play Act Character Line Stage_Directions Line_with_stage_directions Type
andreev-ne-ubiy ДЕЙСТВИЕ ПЕРВОЕ Входит дворник Яков. stage_direction
<sp who="#yakov">
            <speaker>Яков.</speaker>
            <p>Согреешься!</p>
            <stage>Вынимает из кармана полбутылки водки и небольшой кабацкий и исщербленный по краям
            стаканчик, наливает и подносит Василисе Петровне.</stage>
            <p>Выкушайте на здоровье, Василиса Петровна.</p>
          </sp> 
Play Act Character Line Stage_Directions Line_with_stage_directions Type
andreev-ne-ubiy ДЕЙСТВИЕ ПЕРВОЕ Василиса Петровна Да вот думаю все. (не поднимая головы и не меняя позы) (не поднимая головы и не меняя позы) Да вот думаю все. line

This way seems rational to me.