NaNoGenMo / 2023

National Novel Generation Month, 2023 edition.
26 stars 2 forks source link

The Bottish Play #27

Open greg-kennedy opened 8 months ago

greg-kennedy commented 8 months ago

The Bottish Play

A computer speech audio production of Shakespeare's "Macbeth"

Listen to the Final Production

Listen to the Encore

View the Repository

The Write-Up

While trying to generate a lot of junk audio to consume bandwidth against someone's NFT project, I once again fell into the rabbit hole of computer text-to-speech synthesis. War and Peace can be easily turned into a 47 hour read-a-thon, but upon reviewing it, I found a lot of quality issues that I thought I could fix up. "I need to take off the Gutenberg header. And the Unicode characters aren't being parsed correctly. How about trying to find quotes, and maybe change the voice there? In fact, I could start with a play, which already has speakers clearly defined..." Classic mistake: now I've tricked myself into a project.

This year, I'm working on an audio book. NaNoGenMo has had audiobook entries before (see Scam Likely and Gaimidian Graveyard), as well as NaOpGenMo, but this one is my take on William Shakespeare's play Macbeth. The goal is to use TTS software to turn Macbeth into a listenable audio version.

The first thing to do is to get a copy of Macbeth. I did begin looking at Gutenberg and other online libraries, but the idea of parsing out the text format to assign speeches to speakers seemed annoying. That's when I ran across the Folger Library's XML version. "Aha!" I thought. "They have hopefully already done the work for me".

TEI Format

It turns out their version follows the Text Encoding Initiative guidelines. This is an XML-based markup system for capturing text as it's printed, including page breaks, metadata, annotations, and much more. The idea is noble, in that it should be able to cover pretty much any kind of printed text, and indeed TEI formatted files are available in a lot of places. I even ran across TEI annotated newspaper clippings of 19th century London ghost stories!

Unfortunately, TEI does seem to be very application-specific in practice. The Folger texts have XML elements for every word, space, punctuation character, line break, sound effect, stage direction, speech, etc. They can be nested within one another as well. Quotation marks are a block, as are song titles, names, foreign words... I'm sure this is great for scholars or something, but I just ended up hacking together a parser that works on Macbeth and I will be extremely lucky if it works on any other Folger play, let alone a generic document.

<sp xml:id="sp-0001" who="#WITCHES.1_Mac">
   <speaker xml:id="spk-0001">
       <w xml:id="w0000200">FIRST</w>
       <c xml:id="c0000210"></c>
       <w xml:id="w0000220">WITCH</w>
   </speaker>
   <ab xml:id="ab-0001">
       <lb xml:id="lb-00005"/>
       <milestone unit="ftln" xml:id="ftln-0001" n="1.1.1" ana="#verse" corresp="#w0000230 #c0000240 #w0000250 #c0000260 #w0000270 #c0000280 #w0000290 #c0000300 #w0000310 #c0000320 #w0000330 #p0000340"/>
       <w xml:id="w0000230" n="1.1.1">When</w>
       <c xml:id="c0000240" n="1.1.1"> </c>
       <w xml:id="w0000250" n="1.1.1">shall</w>
       <c xml:id="c0000260" n="1.1.1"> </c>
       <w xml:id="w0000270" n="1.1.1">we</w>
       <c xml:id="c0000280" n="1.1.1"> </c>
       <w xml:id="w0000290" n="1.1.1">three</w>
       <c xml:id="c0000300" n="1.1.1"> </c>
       <w xml:id="w0000310" n="1.1.1">meet</w>
       <c xml:id="c0000320" n="1.1.1"> </c>
       <w xml:id="w0000330" n="1.1.1">again</w>
       <pc xml:id="p0000340" n="1.1.1">?</pc>
       <lb xml:id="lb-00010"/>
       <milestone unit="ftln" xml:id="ftln-0002" n="1.1.2" ana="#verse" corresp="#w0000350 #c0000360 #w0000370 #p0000380 #c0000390 #w0000400 #p0000410 #c0000420 #w0000430 #c0000440 #w0000450 #c0000460 #w0000470 #p0000480"/>
       <w xml:id="w0000350" n="1.1.2">In</w>
       <c xml:id="c0000360" n="1.1.2"> </c>
       <w xml:id="w0000370" n="1.1.2">thunder</w>
       <pc xml:id="p0000380" n="1.1.2">,</pc>
       <c xml:id="c0000390" n="1.1.2"> </c>
       <w xml:id="w0000400" n="1.1.2">lightning</w>
       <pc xml:id="p0000410" n="1.1.2">,</pc>
       <c xml:id="c0000420" n="1.1.2"> </c>
       <w xml:id="w0000430" n="1.1.2">or</w>
       <c xml:id="c0000440" n="1.1.2"> </c>
       <w xml:id="w0000450" n="1.1.2">in</w>
       <c xml:id="c0000460" n="1.1.2"> </c>
       <w xml:id="w0000470" n="1.1.2">rain</w>
       <pc xml:id="p0000480" n="1.1.2">?</pc>
   </ab>
</sp>

That said I did think the XML files were pretty neat, mainly for their metadata section: at the top are editor's notes, a detailed description of the format, information about printed editions, corrections, and even a detailed character list - which includes character names, relationships to each other, gender, groupings (the three murderers are in a Murderers group), even the point in the text where the character dies! There's also "milestone" indicators which classify speeches as "verse" or "prose" - if I wanted to, I could use this to adjust the speech emphasis.

Anyway, with that done I need an output format.

SSML Format

The solution comes in the form of Speech Synthesis Markup Language, another XML flavor, but this one designed to feed into text-to-speech systems. It is a W3C standard, implemented now by a number of cloud-based TTS systems like Polly, Azure, Google Speech, etc. It also works with the Windows' on-device Speech API, both for desktops and mobile devices.

Again, despite being a "standard", SSML has a lot of vendor-specific support and/or extension. The broad outline is the same: a <speak> element, containing <voice> definitions, a <p> and <s> for paragraph / sentence, <break> to add pauses, as well as inline hints such as <say-as> (to tell the synthesizer to read digits of a number instead of a whole numeral), <emphasis> to mark inflection and volume changes, and <prosody> for general speech effects. You can even add <audio> to introduce a prerecorded sound file, or speak IPA phonemes directly! It's really quite flexible.

Still, the vendor may add attributes or features that don't work on other platforms. Azure TTS, for example, lets you use add "speaking style" (angry, newscast, whispering, sports_commentary) which does not work anywhere else. The SAPI 5.3 on Windows is much more limited. It does, at least, support voice changes.

In fact, I found a Microsoft blog post where they go through carefully tagging the introductory scene of Macbeth for better replay.

<speak
version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
   <p>
       <s>When shall we three meet again
           <break/>
           <prosody rate="slow">in
               <emphasis level="moderate">thunder,</emphasis>
               <break time="200ms"/>lightning,
               <emphasis level="reduced">
                   <break time="200ms"/>or in rain?
               </emphasis>
           </prosody>
       </s>
       <s>When the
           <emphasis level="strong">hurlyburly’s</emphasis> done
       </s>
       <s>
           <break time="500ms"/>When the battle’s 
           <emphasis level="moderate">lost</emphasis> and won
       </s>
       <s>
           <break time="500ms"/>That will be ere 
           <break time="200ms"/>set of sun
       </s>
   </p>
   <p>
       <s>
           <break time="500ms"/>Where the place?
           <break time="250ms"/>
       </s>
       <s>
           <emphasis level="reduced">Upon the heath
               <break time="1s"/>
           </emphasis>
       </s>
       <s>There to meet
           <break time="500ms"/>with 
           <emphasis level="strong">Macbeth</emphasis>
       </s>
   </p>
</speak>

Pretty cool! Also, way more time involving than I want to put into it. We'll stick with the defaults.

The Program

The core of this entry's first draft, then, is a tool to translate the Macbeth TEI into Microsoft-compatible SSML. You can find it in the repository (https://github.com/greg-kennedy/The-Bottish-Play), filenamed tei2ssml.pl.

There is one additional .xml file needed to complete the translation: a voices.xml file, which maps the play speakers to attributes of the SSML <voice> tag - in effect, casting the characters. I have provided one that uses all the English available voice packs for Windows 10, as well as narration by Microsoft Eva - a "hidden" TTS voice which is an early version of Cortana, but can be enabled again using some registry tweaks. It's a quick mapping and doesn't quite work (you can't make a "male" voice speak with "female" affect, so the genders get swapped, though this works on other systems) - I also think the "age" tag doesn't function, though it should accept values of 10, 15, 30, or 65 according to the documentation. But it's OK for a first draft.

Performing the SSML

Once I have an output XML file, then a short PowerShell sequence will invoke the TTS engine and read into an output .wav file.

Add-Type -AssemblyName System.Speech
$Speech = New-Object System.Speech.Synthesis.SpeechSynthesizer
$Text = Get-Content -Path "out.xml" -Raw
$Speech.SetOutputToWaveFile("output.wav")
$Speech.SpeakSsml($Text)

Et voilà! We have a fully spoken play, with separately voiced characters! And you can listen to it here:

https://youtu.be/NyBwvhex4dk

(PowerShell can also speak plain text instead of SSML, with $Speech.Speak("Hello, world!") - useful in a pinch if you need some audio and have no other tools available.)

Epilogue

I like Macbeth most of Shakespeare's plays, not least because it is a lot shorter than Hamlet and the rest. That's a major drawback for NaNoGenMo, though, where 50,000 words are needed to clear the bar. I consider the SSML file to be the "novel", in that it is essentially a "script" as one would use for a play, except formatted for computers instead of humans. Even so, the online word counter clocks only 26,199 words.

There is, of course, only one solution: at the end of the play, the cast goes on for an encore performance of "Cats" :P

That said, the month is barely half finished - and I have ideas of how to continue this further! Stay tuned...

greg-kennedy commented 7 months ago

Click to watch: The Robot Community Theatre Presents: "Macbeth"

What?

Despite proving that Macbeth can be turned into an audiobook, I find the result lacking: Microsoft's built-in voices don't cover most of the characters, and they're a little monotonous. Really, there's nothing to the final construction other than concatenating spoken phrases together. Why should they all come from this one particular speech synth?

The process is "simple": take the cast members, identify a speech-synthesizer (modern, retro, weird, whatever), record the lines using the synth, and assemble them together.

Welcome to my personal hell!

The Cast

The last speaker in the text... is me! I recorded myself saying ~200 words of stage directions, then used sox to put the words back together into composite phrases for all direction and act / scene announcements. This is a common technique also for phone menus, etc. A "talking clock" can be made by just recording a handful of words and triggering them correctly.

greg-kennedy commented 7 months ago

More details

A big part of the last couple days' effort was listening to the whole play end-to-end and trying to clean up common mispronunciations. Most of the speakers stumbled over Shakespeare's (mis)use of apostrophes, as in 't for "it" (which readers will say "TEE") or i' for "in", 's for "his", etc. Some regex and re-runs got most of the issues shaken out, but a few snuck in... Witch Three says "Listen but not speak to tee" about the spirits, and towards the end Macduff reads "That way the noise is. Tyrant, show thy face!" as "That way the noise Island Tyrant, ..." which is a very funny choice. I find the older pronciation quirks charming, especially Duncan talking about "nobleness" (knob LEE ness). Pronunciation dictionaries in newer software mostly sort this out, but homographs still throw them off sometimes ("he lives here" vs "their lives are cut short"). But that's true of me too...! I read the word list off and pronounced "sewer" as in "the underground pipe system", when the text actually means "a person who sews (clothing)", whoops.

Sound effects are provided by a selection from Windows 95 "Plus!" themes (Musica, Leonardo da Vinci, etc) as well as some music from Interplay's "Castles" and Maxis "SimCity 2000". This is the weakest part of the project, I think: I wish that I had more time to come up with something more fitting here. It would have been great to have e.g. Hatsune Miku sing for the musical cues, but I did run out of time, and the sound portions are not the most important ones anyway.

For the video, I took screenshots of the Folger Library "Macbeth" and then wrote Processing to scroll through it. Each act/scene is synchronized, such that the start and end match the complete scene. What happens in between could be anything - the scene where the Witches summon the Spirits runs off the page for a bit, until some of the longer-winded speakers take long enough to bring it back on the screen.

Encore! Encore!!

Placing every spoken word in the play into a file (including introduction, credits, and stage directions) and running wc script.txt gives a paltry 17,827 words. As before, the cast is happy to take up the slack by putting on a special production of Cats. Rather than spend 2 hours on individual meows, they've worked on improving their efficiency. All 43 cast members will individually say "meow" as fast as they can, simultaneously, which clears the remaining 32,173 words in a mere eight minutes.

Epilogue

This project was fun, and a huge success! I learned a lot about text-to-speech software! If I never hear a robot speak again, it will be too soon! Augh!!

bibliotechy commented 7 months ago

I love this so much!