Open ebeshero opened 3 years ago
@beealex After our meeting, I experimented with regex to convert Order of the Phoenix to XML following your new schema, and it turned out well! I was worried this would be very difficult and wasn't ready to talk you through it during the meeting before I had a good idea of how to approach it. I think I do now!
Here's the new XML, pretty-printed! You can finish off the header element for this, and see if the structure makes sense: https://github.com/beealex/harryPotter/blob/master/xml/HP-OrderofthePhoenix.xml
I did not actually record my regex steps (sorry!)--but if most of the plays follow Order of the Phoenix, this is generally how I approached it:
1) Look for where lines begin with a series of block capital letters. Use the "close-open" strategy to apply </sp>\n<sp>
so every new start of a line with block caps gets tagged as a speech. (Move the extra </sp>
from the top of the file to the end.)
2) Then look for the <sp>
that begin with EXT. or INT. using <sp>([EXINT]+\.)
and turn that into scene boundaries (use "close-open" strategy again):
</scene>\n<scene>\1
(Move the extra </scene>
from the top of the file to the end.)
3) Then use regex to tag the next portion as a stage direction with @cat
and @where
4) After that we have reliable scene boundaries. So then we have a problem of stage directions mixed up in speeches, and we need to add the @who
attributes to the speeches. Since speeches nearly all put the speaker alone on a line, I used that pattern to isolate those and add @who
attributes to the real speeches: So these now look like:
<sp who="#DUMBLEDORE">....</sp>
When the name has spaces inside or punctuation, I used regex to remove it--but I was able to scope for patterns of these, like MRS. WEASLEY
and edit those down like this:
<sp who="#MRSWEASLEY">....</sp>
5) Then we look at what's left: It's all <sp>....</sp>
without @who
attributes. There are various issues: Some of these really are speeches (like NEARLY-HEADLESS NICK which I missed b/c it had a hyphen inside). You can fix those with regex. Eventually you get to a point where everything left inside <sp>....</sp>
is a stage direction, so you can regex those.
6) Finally, there are a number of places where we see a short speech followed by a couple of returns, and then stage directions. These are all tucked inside speeches, but the stage directions start on new lines. I found lots of these by searching for text that was NOT an angle bracket starting after newlines and following it up to a close-tag </sp>
(\n\n[^<]+?$[^<]+?)(</sp>)
When I surveyed those, they were all long stage directions basically inside speeches, so I replaced with:
<stage>\1</stage>\2
I did some manual clean-up, added the <script>
element around the stuff I'd regexed, pasted in the header from your XML model, associated the schema, and posted it here.
@beealex Go through that file and see if you find things that need to be corrected...And if it looks good, I think we can get the other files regexed to this point, too!