dracor-org / engdracor

English Drama Corpus
Other
0 stars 1 forks source link

Adding <div type="configuration"> #46

Open lucagiovannini7 opened 1 year ago

lucagiovannini7 commented 1 year ago

Many Early Print plays are not split into scenes or acts, and all text is included inside a single <div type="play"> (example). Can I add some <div type="configuration"> to the files in the TEI folder, at least for the plays which I am using in my dissertation? This improves dramatically both the computation of all segment-based metrics and the network visualisation.

peertrilcke commented 1 year ago

Hi @lucagiovannini7,

Good point!

(1) My opinion on this in a nutshell: For your phd thesis, working with configurations is definitely an option worth considering. Maybe it is even without alternative. For the DraCor infrastructure, however, it would be a major change that would be extremely time-consuming (we actually talked about it last week in Potsdam, the consequences for large parts of the infrastructure are significant).

(2) In principle, configurations are a useful addition, maybe even a convincing alternative of constructing networks. It would be a separate study, long awaited by me, to once systematically evaluate how the two construction methods (scene/act-based vs. configuration-based) differ. It would be conceivable to consider other construction methods as well (e.g. window-based). Maybe you write this study? Or it becomes part of your phd thesis? But overall this is important: You should not combine the different construction methods in any case, they are just different.

(3) Just aside: I don't understand what you mean when you say that the calculation and visualization is "improved" by using configurations for network construction. I think that is wrong. The networks are correct also now, they follow just a certain way of construction. What you want is a different construction method. To say that this is "better" (in the sense of "improved"), you would have to have a "real network" or another kind of reference with which you can compare the network constructred. However, plays are not networks. There are only different ways to construct them as networks.

(4) But back to your point: to include configurations altogether in DraCor would mean that a) we would have to add configuration information for all plays in DraCor; and that b) we would have to implement a second option (layer, moduls etc.) of visualizing and computing networks. So we would have to duplicate many parts of DraCor, so to speak. This is what I mean by Major Changes.

(5) An alternative: (a) we at least allow configuration information in the DraCor XMLs, but do nothing with it on DraCor (frontend, metrics service, API, etc.). This would be something to discuss from my point of view (what do you mean @lehkost, @ingoboerner, @cmil). Another option would be: (b) You work for your corpus on a separate Luca branch, where you include the configuration info there for all plays. But please keep in mind: If, then you have to do this for all plays imho.

(6) I am skeptical whether the configuration informations should be mapped using another div structure. Configurations can also be stable across e.g. scene boundaries. In this respect, I would rather be in favor of a kind of marker for configuration changes, not a new div structure. This should be discussed.

(7) So my suggestion to you, Luca, would be:

Maybe we should discuss this further in person?

Ciao!

lehkost commented 1 year ago

Hi Luca, hi Peer – I agree we should not include configuration divs in any of the general upstream corpora, but they are fine in personal corpora like yours, Luca.

Another way of coming closer to configurations rather than act/scene breaks for extracting networks is to encode stage directions with "exit" and "enter" attributes, assigning them to character IDs. There is a bachelor thesis on this topic by Lena Ehlers: "Extraktion von Figurenauf- und -abtritten aus XML-codierten Dramatexten" – it was also presented as poster at DHd2023, see book of abstracts. The same problem here is that, if we start it, we should do this for all plays in all corpora, which is obviously too much of a task. But having this additional standard TEI-markup in some texts wouldn't hurt as we don't extract any of this information (yet).

lucagiovannini7 commented 1 year ago

Hi @peertrilcke, thanks for this detailed answer! We should definitely discuss that in person at some point. My question here was actually restricted to EPDraCor, but now I see it all comes back to some misunderstanding on my part about the concept of configuration in DraCor.

While I was aware of its theoretical meaning as defined by S. Marcus, I always thought that, in our DraCor encodings, we used to insert the configuration divs when no explicit scene divs were provided, with the aim of providing some segmentation and make all algorithms work.

In most early modern theatre, segmentation can often be inferred by the play's content and stage directions (e.g. from expressions like "Exeunt", "Curtains", etc.) but is not always encoded in the XML files (when provided). To put it bluntly: scenes are there, but they are not marked as such, as in the play by Johnson I linked above.

Since many of our algorithms rely on segments to perform their calculations, I felt segmentation, when clearly visible in text but not encoded yet, needed to be added. I knew I shouldn't add artificial scene divs and I found no other viable element in the ODD, so I added configurations for all plays in my corpus which did not have other segmentation.

What I see now, however, is that one should call these divs in another way, to avoid confusion with the concept of configuration as you mean it. Perhaps one could just call them scenes, but record in the headerthat they have been added as an editorial intervention (that's what I did when I added configurations, btw). If this is your preferred syntax, let me know and I'll update the markup in all my plays.

To (2): the topic of configuration- vs. scene-based segmentations is extremely interesting and I would definitely like to explore it in the future!

I would be interested also to hear @DanilSko on this topic, since I believe he might have had similar issues.

EDIT: thanks @lehkost, I also spoke with the poster's authors at the DHd23 about that, I don't think it suits my needs but I will definitely check it out further!

peertrilcke commented 1 year ago

So, if "scenes are there" but not represented in the (historical) print, maybe just add them, it think @lehkost did it that way

lucagiovannini7 commented 1 year ago

As it often happens, it seems the easiest solution was actually the right one 😃 I will do that, documenting what I did in the revisionDesc. So, going back to the original question, can I do the same for (some) Early Print plays here? I think @cmil confirmed that editorial interventions on TEIs can and will be preserved even after updates from the source corpus.

cmil commented 1 year ago

I think @cmil confirmed that editorial interventions on TEIs can and will be preserved even after updates from the source corpus.

@lucagiovannini7 Maybe there was some misunderstanding, but editing the TEI files in this repo would break the XSLT workflow. The changes would not survive an update from the sources. For that to work the changes would have to be made on the dracor branch of the epdracor-sources repo.

DanilSko commented 1 year ago

Maybe I missed some nuances, but this discussion sounds like Luca's offering something that's never been done on dracor. But we do have quite a few plays with div type configuration in some 'general' corpora, don't we? E.g. : https://dracor.org/api/corpora/ger/play/gryphius-catharina-von-georgien/tei https://dracor.org/api/corpora/rus/play/chekhov-chaika/tei https://dracor.org/api/corpora/ger/play/nestroy-das-haus-der-temperamente/tei https://dracor.org/api/corpora/ger/play/gryphius-carolus-stuardus/tei ...and more (I'd say there's at least 8 in GerDraCor alone, plus a number of Russian ones). So I am not sure I understand you, Frank @lehkost , when you're saying I agree we should not include configuration divs in any of the general upstream corpora We've been doing that for a long time. My impression is that the only difference in Luca's case is just that it is harder to implement, since Luca cant work on TEI/XML-s directly, since they are constantly re-created from source. But technical issues aside, I do not really understand the reasons not to add these divs. Same goes for @peertrilcke Peer's suggestion to Add configuration info to one DraCor XML as a kind of test case Either I do not understand some major difference here or.. this has been done already years ago. We have at least a dozen of such DraCor XMLs on DraCor. See four of them linked in this message above.

lucagiovannini7 commented 1 year ago

@cmil you are right, I forgot that. Technical question: suppose that one wants to add a div type="scene" after the div type="play" in this chunk (example taken from here):

<body xml:id="A00456-e100670">
   <div type="play" xml:id="A00456-e100680">
    <pb facs="tcp:4660:3" xml:id="A00456-003-a"/>
    <sp who="A00456-virginius" xml:id="A00456-e100690">
     <stage xml:id="A00456-e100700">
      <w lemma="enter" pos="vvb" xml:id="A00456-003-a-0010">Enter</w>
      <w lemma="Virginius" pos="nn1" xml:id="A00456-003-a-0020">Virginius</w>
      <pc unit="sentence" xml:id="A00456-003-a-0030">.</pc>
     </stage>

Should one also add +10 to all xml:ids to keep the numbering of sp, stage and div progressive? Possible output:

<body xml:id="A00456-e100670">
   <div type="play" xml:id="A00456-e100680">
    <pb facs="tcp:4660:3" xml:id="A00456-003-a"/>
     <div type="scene" xml:id "A00456-e100690">
      <sp who="A00456-virginius" xml:id="A00456-e100700">
       <stage xml:id="A00456-e100710">
        <w lemma="enter" pos="vvb" xml:id="A00456-003-a-0010">Enter</w>
        <w lemma="Virginius" pos="nn1" xml:id="A00456-003-a-0020">Virginius</w>
        <pc unit="sentence" xml:id="A00456-003-a-0030">.</pc>
       </stage>
cmil commented 1 year ago

@lucagiovannini7 Don't change existing IDs. The numbers in the existing IDs have intervals of 10 which leaves some space to insert new IDs without having to adjust the sequence. Just increment the number of the ID of the parent or previous sibling when you add an element. In your example the new div should get the ID "A00456-e100681", or "A00456-e100685" if you want to leave some space for later insertions.

peertrilcke commented 1 year ago

Maybe I missed some nuances, but this discussion sounds like Luca's offering something that's never been done on dracor. But we do have quite a few plays with div type configuration in some 'general' corpora, don't we? E.g. : https://dracor.org/api/corpora/ger/play/gryphius-catharina-von-georgien/tei https://dracor.org/api/corpora/rus/play/chekhov-chaika/tei https://dracor.org/api/corpora/ger/play/nestroy-das-haus-der-temperamente/tei https://dracor.org/api/corpora/ger/play/gryphius-carolus-stuardus/tei ...and more (I'd say there's at least 8 in GerDraCor alone, plus a number of Russian ones). So I am not sure I understand you, Frank @lehkost , when you're saying I agree we should not include configuration divs in any of the general upstream corpora We've been doing that for a long time. My impression is that the only difference in Luca's case is just that it is harder to implement, since Luca cant work on TEI/XML-s directly, since they are constantly re-created from source. But technical issues aside, I do not really understand the reasons not to add these divs. Same goes for @peertrilcke Peer's suggestion to Add configuration info to one DraCor XML as a kind of test case Either I do not understand some major difference here or.. this has been done already years ago. We have at least a dozen of such DraCor XMLs on DraCor. See four of them linked in this message above.

Yes you are right. Unfortunately, the examples are imho based on a false notion of configuration. In my view, these are not examples of good practice or even acceptable practice, but would have to be corrected. I would not use div type "configuration" in these cases, but rather something neutral like "segment" (or simply "scene").

cmil commented 1 year ago

Two things: First, maybe we should distinguish between in-house corpora and imported ones. While for in-house corpora it's ok to mark up the plays the way we want, this may not be practical for imported corpora because we do not have the resources to do it consistently for the entire corpus, or the upstream sources may even have a different idea of how to do things.

Second, I'm not a literary scholar, but looking at both the TEI and the EarlyPrint text of the mentioned Dekker play, I don't find it obvious where scenes change. So adding respective markup as an adjustment to the original sources might be controversial. (The general idea of editing the TEI in epdracor-sources is to correct obvious mistakes or inconsistencies rather than making conceptual additions.)

Maybe, @lucagiovannini7, this may be a point where you want to consider setting up a repo of your own containing only the pieces you are working on for your thesis. Then you would be free to augment the TEI in any way that fits your research. And it would probably make doing this much easier than working on the original EarlyPrint TEIs.

lucagiovannini7 commented 1 year ago

I agree with your last point, and this was actually long overdue. At first, my idea was contributing to the DraCor plays I use in my dissertation with edits and fixes, but this is not time- and resource-efficient, and I don't want to take up other people's time to review my edits or discuss editorial decisions. Two thirds of my corpus were already made of own-encoded TEIs running on a local dockerized DraCor, so now I'll fork the remaining 1/3 and modify them as I find suitable. At the end of the project, I will commit (as agreed with @peertrilcke) the plays which were not previously in DraCor, and perhaps we will assess if the modifications I did to existing files were sound.

lehkost commented 1 year ago

A quick note on the DraCor plays that already have configuration divs. The 4 late Chekhov plays in RusDraCor were a test bed and I wouldn't continue work along those lines.

The GerDraCor examples didn't come out of nowhere, they were "copied" from different editions of the play that – other than our editions – had the scene breaks. Another possibility to do scene breaks is <div type="location"> if there is such information (example: https://dracor.org/id/ger000500).

Nestroy's "Haus der Temperamente" is a very different case as there are 4 parallel plot threads/stages that reach into each other and we used configuration divs in two cases when all four plots/stages were speaking at once.

So, with the exception of 4 Chekhov plays, we didn't "invent"/introduce configuration divs and I would recommend to really think twice before starting such endeavour. Judging from my experience, it's a pitfall and the data is not very usable beneath your specific use case (which would be network visualisations). However, I think it's okay for single-purpose corpora, but not for upstream corpora.

P.S. On the long run, I'm of course open to change he "configuration" value for the div type attribute in existing cases as it's been used as placeholder for different purposes as I detailed above.