XML Structure Ideas and Regex

ebeshero commented 3 years ago

<xml>
      <header><!--metadata in here -->
               <title>.....</title>
               <screenWriter>....</screenWriter>
                       <drafts>
               <revision color="..." date="yyyy-mm-dd"/>
                <revision color="..." date="yyyy-mm-dd"/>
                <revision color="..." date="yyyy-mm-dd"/>
          </drafts>
         <productionStart date="yyyy-mm-dd"/> <!--Optional: if available. -->
          <release date="yyyy-mm-dd"/><!--Look this up and verify -->
<digitalProject>
                <editor who="bca">Bianka Alexander</editor>
                 <span from="yyyy-mm-dd" to="yyyy-mm-dd"/>
                  <desc>.......</desc>
                  <sourceText>
                           <siteName>.....</siteName>
                           <link><!--Website URL --></link>
                            <dateRetrieved>yyyy-mm-dd</dateRetrieved>
                  </sourceText>
             </digitalProject>
      </header>
      <script>
              <scene n="5">
                   <stage type="setting">EXT. SPINNER’S END - LATE AFTERNOON (MOMENTS LATER)</stage> 
                <stage type="business">Like a rat in a maze, Narcissa makes her way through a labyrinth of dilapidated brick houses. Bellatrix trails.</stage> 
                     <sp who="BELLATRIX">Cissy! You mustn’t do this. He can’t be trusted.</sp>
                      <sp who="NARCISSA">The Dark Lord trusts him.</sp>
             </scene>

       </script>
</xml>

ebeshero commented 3 years ago

@beealex After our meeting, I experimented with regex to convert Order of the Phoenix to XML following your new schema, and it turned out well! I was worried this would be very difficult and wasn't ready to talk you through it during the meeting before I had a good idea of how to approach it. I think I do now!

Here's the new XML, pretty-printed! You can finish off the header element for this, and see if the structure makes sense: https://github.com/beealex/harryPotter/blob/master/xml/HP-OrderofthePhoenix.xml

I did not actually record my regex steps (sorry!)--but if most of the plays follow Order of the Phoenix, this is generally how I approached it:

1) Look for where lines begin with a series of block capital letters. Use the "close-open" strategy to apply </sp>\n<sp> so every new start of a line with block caps gets tagged as a speech. (Move the extra </sp> from the top of the file to the end.)

2) Then look for the <sp> that begin with EXT. or INT. using <sp>([EXINT]+\.) and turn that into scene boundaries (use "close-open" strategy again): </scene>\n<scene>\1 (Move the extra </scene> from the top of the file to the end.)

3) Then use regex to tag the next portion as a stage direction with @cat and @where

4) After that we have reliable scene boundaries. So then we have a problem of stage directions mixed up in speeches, and we need to add the @who attributes to the speeches. Since speeches nearly all put the speaker alone on a line, I used that pattern to isolate those and add @who attributes to the real speeches: So these now look like:

<sp who="#DUMBLEDORE">....</sp>

When the name has spaces inside or punctuation, I used regex to remove it--but I was able to scope for patterns of these, like MRS. WEASLEY and edit those down like this:

<sp who="#MRSWEASLEY">....</sp>

5) Then we look at what's left: It's all <sp>....</sp> without @who attributes. There are various issues: Some of these really are speeches (like NEARLY-HEADLESS NICK which I missed b/c it had a hyphen inside). You can fix those with regex. Eventually you get to a point where everything left inside <sp>....</sp> is a stage direction, so you can regex those.

6) Finally, there are a number of places where we see a short speech followed by a couple of returns, and then stage directions. These are all tucked inside speeches, but the stage directions start on new lines. I found lots of these by searching for text that was NOT an angle bracket starting after newlines and following it up to a close-tag </sp>

(\n\n[^<]+?$[^<]+?)(</sp>)

When I surveyed those, they were all long stage directions basically inside speeches, so I replaced with:

<stage>\1</stage>\2

I did some manual clean-up, added the <script> element around the stuff I'd regexed, pasted in the header from your XML model, associated the schema, and posted it here.

ebeshero commented 3 years ago

@beealex Go through that file and see if you find things that need to be corrected...And if it looks good, I think we can get the other files regexed to this point, too!

beealex / harryPotter

XML Structure Ideas and Regex #12