SMIL output format for the "alignment" command

echogarden-project / echogarden

Easy-to-use speech toolset. Written in TypeScript. Includes tools for synthesis, recognition, alignment, speech translation, language detection, source separation and more.

GNU General Public License v3.0

143 stars 16 forks source link

SMIL output format for the "alignment" command #42

Open Astorsoft opened 4 months ago

Astorsoft commented 4 months ago

I've been trying to make an audiobook/epub alignment tool work on non-english content for some time without success. Tools like Storyteller, Aeneas or Syncabook exist but either support only English or end up with very subpar results.

I found echogarden by pure chance and was amazed by how good the alignment ended up being from the first attempt on Swedish content. Unfortunately, as far as I'm aware the only alignment outputs available are VTT and SRT. Any chance you could add SMIL format to the list? That would allow people like me to leverage Echogarden to create the alignment and then glue everything together in an Epub with media overlay through other apps like Syncabook. Moreover, it could encourage those solutions like Storyteller to use echogarden in their backend to get better multilingual support.

rotemdan commented 4 months ago

Yes, now I remember looking into SMIL and EPUB 3 in the past.

Seems like the links on the W3C site are mostly broken, including the tutorials. The last update to the standard was in 2008.

SMIL is originally aimed produce interactive audio-visual presentations using XML (not really supported in modern browsers).

EPUB 3 supports some sort of subset of the SMIL format to provide the synchronization between text items and the audio. This is the example I saw on this website:

<smil xmlns="http://www.w3.org/ns/SMIL"
      xmlns:epub="http://www.idpf.org/2007/ops"
      version="3.0">
   <body>
      <seq
           epub:textref="chapter_001.xhtml"
           epub:type="bodymatter chapter">

         <par>
            <text src="chapter_001.xhtml#c01h01"/>
            <audio
                   src="audio/c01.mp4"
                   clipBegin="0:00:00.000" 
                   clipEnd="0:00:05.250"/>
         </par>

         <par>
            <text src="chapter_001.xhtml#c01p0001"/>
            <audio
                   src="audio/c01.mp4"
                   clipBegin="0:00:05.250"
                   clipEnd="0:00:58.100"/>
         </par>

         <par>
            <text src="chapter_001.xhtml#c01p0002"/>
            <audio
                   src="audio/c01.mp4"
                   clipBegin="0:00:58.100"
                   clipEnd="0:02:04.000"/>
         </par>
      </seq>
   </body>
</smil>

The way it's structured, though, requires referencing the xhtml file and the identifier of the paragraph in the SMIL file.

In order to produce something like this, Echogarden may need to accept and parse the entire EPUB book, including all of its structure. Then process each part to produce an SMIL file for it. Then the xhtml files need to refer back to the SMIL file.

<item id="xchapter_001"
      href="chapter_001.xhtml"
      media-type="application/xhtml+xml"
      media-overlay="chapter_001_overlay"/>

<item id="chapter_001_overlay"
      href="chapter_001_overlay.smil"
      media-type="application/smil+xml"/>

On the Apple book asset guide, I found this example, that seems to have word boundaries:

<p>
    <span id="word0">Shall</span> 
    <span id="word1">I</span> 
    <span id="word2">compare</span> 
    <span id="word3">thee</span> 
    <span id="word4">to</span> 
    <span id="word5">a</span> 
    <span id="word6">summer's</span> 
    <span id="word7">day?</span>
</p>

<?xml version="1.0" encoding="UTF-8"?>
<smil xmlns="http://www.w3.org/ns/SMIL" version="3.0"
    profile="http://www.idpf.org/epub/30/profile/content/">
    <body>
        <par id="par1">
            <text src="en.lproj/page1.xhtml#word0"/>
            <audio src="en.lproj/audio/page1.m4a" clipBegin="5s" clipEnd="15s"/>
        </par>
        <par id="par2">
            <text src="en.lproj/page1.xhtml#word2"/>
            <audio src="en.lproj/audio/page1.m4a" clipBegin="15s" clipEnd="25s"/>
        </par>
    </body>
</smil>

So it looks like there's a kind of "hack" where each word is defined with a par element. The element identifiers refer to the span elements to describe each word.

It would require parsing the entire EPUB book with all of its files, and then effectively rewriting the entire markup with the added word tags.

That's a large amount of work to achieve something like this. It's much more complex than outputting subtitles since it requires modifying existing documents. However, I could generate this kind output from a given text (say a plain text file) without much difficulty, but working with an existing set of documents and preserving the existing markup makes it more challenging. (There are probably many sorts of edge cases that might occur.)

I'll need to look further into what software supports media overlay in EPUB3, and implements the synchronization correctly, potentially up to the word level.

It would be great if there were well-made, polished, web based readers I could test with. Unfortunately, last time I checked the PC and web-based readers are not very high quality, but there may have been some improvement since - I'll need to get more up to date about that.

Anyway lately I'm not putting a lot of time to this project, so maybe some time in the future.

Astorsoft commented 4 months ago

Hi! First I wanted to say thank you so much for the fast and detailed response, really appreciated. I understand new features might not be a priority at the moment. In case you want to look into it in the future, a few extra information:

Indeed the main goal's of Epub3 support for SMIL/Media Overlay is to have both Audiobook and eBook in one, always in sync, either to be able to switch between listening and reading a story seamlessly and always staying in sync (similar to the proprietary WhisperSync feature from Amazon) or to help learn new language by having supporting text to help listening comprehension or knowing how to pronunce what you read on the go (this is the main use case I have).
You're right, support is very limited, but it's better than a few years ago.
- On Desktop platforms you have Thorium Reader which is fully supporting Media Overlay and open source.
- On Android, there is Colibrio Reader which is also free and supporting Media Overlay, however, it doesn't support all audio encoding format, mp3 works, but OPUS doesn't, so be careful what you test it with. You have other apps that pretend to support it, but either I got no sound (ex: BookFusion) or the audio playback was bugged and unuseable (ex: Reasily).
- I don't use iOS, not sure what app is best to test with nowadays, there used to be an app called Menestrello but as far as I'm aware it's unmaintained.
I'm sure preserving the original epub structure would be amazing in some use case, but to be fair for most novel as long as the content is kept you're pretty much good to go. Cherry on top if you manage to add back the cover, table of content and colophon back after aligning.
Converting epub into text is very easy using a tool call Calibre, just open the epub, convert > TXT. It's ok for the tool to expect being given TXT as input I would say.
Regarding sync, the common pitfalls are:
- Text present in the book and not in the Audiobook: Table of content at the beginning, Colophon at the end, sometimes even a preview of the next book like in Harry potter serie. Either you expect the user to clean it beforehand like Syncabook does or you try to cleverly filter it out first.
- Some audio is not in the text, most of the time there is an Intro and an Outro, with a bit of music. When I tested echogarden the align was confused at the beginning before of it and wrongfully advanced one or two sentences in the source text, but quickly caught up after that and never derailed again so the issue is very minor to be fair.
In terms of preprocessing, other tools seem to indeed create XHTML files with the span id="xxx" technique. Syncabook does it, first by splitting chapters using a Regexp provided by the user, then splitting by either sentence or word based on user preferences. They use the NLTK library to know how to split sentences properly regardless of the language.
As the audio source can be quite long (35+ hours), it is often broken down by chapters. Either they are already split or are in one big .m4b file that has metadata and can be used to broke it down into chapters. Again it's not reasonable for an align tool like yours to assume the splitting has already been done.
If you want test material I can gladly provide a working set of mp3/opus,smil,plaintext,xhtml of like Chapter 1 of the swedish Harry Potter first book for example. Sync is only done as the sentence level as other tools have not been good enough to do it at the word level so far.

To summarize, a very good starting point would be to align an existing xhtml file with audio file and create the resulting SMIL file. The text extraction, xhtml creation with ID for each word, and creation of the new ebook would be a plus for user friendliness but are already rather easy to do using existing solutions such as syncabook.

rotemdan commented 4 months ago

Thanks for all the details!

One question I ask is: who would this tool be intended for?

For the average user, it isn't really that easy to give a seamless experience for an eBook and a bunch of audio files. There are many potential issues with things like transitions and music, various differences between the text and audio, very long audio duration etc. What WhisperSync is very complex and probably required a full team and many human-hours to accomplish. It's definitely not the easiest way to approach this problem!

If someone is willing to put the effort to do a lot of manual work to prepare a set of text and audio files that match each other (say, for each chapter), this could be useful, but for the average user, I don't know if that would be practical.

Anyway, I do have a text-to-speech browser extension I haven't released yet. It connects to the local Echogarden server, which then processes and streams audio back to the browser. Here's an example (unmute to hear the speech):

https://github.com/echogarden-project/echogarden/assets/8589488/9b29f5de-36a2-42fa-aa0d-ab970228e143

(the voice in the video is EmmaNeural from Microsoft's Azure Cloud)

The amount of work that was done to make the word highlighting work correctly in 99% of cases was very large, and it still has issues with dynamically updated pages, like the output of chatbots like ChatGPT, Cluade etc., since they dynamically update the DOM many times, which makes the highlighting even more difficult to achieve while the chatbot answer is written to the DOM.

It doesn't actually modify the entire document, but uses DOM traversal to identify elements that are readable. The highlighting temporarily wraps a highlight tag over a part of the text node that is currently spoken (it splits the text node to several parts if needed), and then removes the tag immediately, and restores the exact previous DOM state.

It also supports auto-scrolling (not seen in the video), but that only works only if the scrolling context is the document itself (it can't scroll nested scrollable elements).

Anyway, this same approach and existing code can be used with aligned text. Only here we need to identify the correct part of the text to highlight, given the current audio position, rather than actively requesting synthesis for a particular segment.

This could actually work for any website, and requires almost no preprocessing or manual work at all. The website could just specify an audio file, and the script would locate (by "guessing") the parts of the document to corresponds to the spoken text.

Since Echogarden is capable of doing the alignment pretty fast (we could always lower the granularity to get it even faster), maybe it's practical to do the alignment on-demand - as long as the audio is not too long.

Also, technically the alignment operation can be ported to the browser itself, rather than in a server, since the code is written only in JavaScript and WebAssembly. It would require about an extra 10MB - 20MB download though to get all the code and WASM modules.

All of this can be done but there's a lot of work associated with it, especially in porting some of Echogarden modules to the browser.

Currently, providing on-demand synthesis is more important to me, since it's universally applicable to any textual content and requires no involvement from the website authors.

I definitely want to release this extension at some point. There are still a few missing features (no configuration options at all), and some small issues with the user experience.

Astorsoft commented 4 months ago

If someone is willing to put the effort to do a lot of manual work to prepare a set of text and audio files that match each other (say, for each chapter), this could be useful, but for the average user, I don't know if that would be practical.

I agree, this requires some work and is definitely for power users only at the moment. But as many things on the internet, the beauty is that if it's feasible for reasonably documented power users to do it, then they can share the result of their effort with the less technical folks.

Projects like Storyteller mentioned above are actually trying to make it more accessible with a dead simple webUI (providing you managed to self host a docker :D). For the record I also created an issue on their side to raise awareness on the amazing job you did with Echogarden and if they could benefit from it somehow. I didn't bother raising tickets for other tools as they seem unmaintained.

The amount of work that was done to make the word highlighting work correctly in 99% of cases was very large, and it still has issues with dynamically updated pages, like the output of chatbots like ChatGPT, Cluade etc.,

From my perspective, some small mismatch between audio and text are tolerable as long as the audio just continues in a neverending flow, then the brain just make it work. What is unbearable is when the audio stops, jumps or repeat itself, this totally breaks immersion (such as what happen with the Reasily app on android). In other words as long as your timecodes in the SMIL are perfectly continuous it's fine to have a few mistakes once in a while. I know that's why other solutions default to doing full-sentence syncing instead of word syncing to keep a high tolerance on small misalignments.

rotemdan commented 4 months ago

The difficulty I described was in getting the word highlighting working correctly in the DOM within the browser extension, it wasn't about the accuracy of the timing.

For synthesized text, the timing is usually close to 100% accurate. In the example I gave, which talks to a cloud service (Azure) that already provides exact word timing, the accuracy of the word timing is 100%.

The extension makes requests to the local server, which can use any supported engine to synthesize the text. The local server is started by just running echogarden serve. There's nothing else involved. It doesn't require Docker or Python, only Node.js (which can be made portable - see other threads where this was done).

I looked at Thorium. I realize now it also uses TypeScript and Javascript.

So, it's possible to either load Echogarden as a module (which would require working out how to integrate it with Electron.js - which isn't easy, and add hundreds of megabytes to the installer size), or have the server launched as a separate process, and make requests to it to provide additional text-to-speech functionality. (There is already ready-made TypeScript client code to make requests to the server.)

Text-to-audiobook alignment can technically also be done on-demand by communicating with the Echogarden server, and maybe later cached. The challenge would be to split the audio to several parts, by chapters, if needed, and match the chapters to the resulting audio parts.

Also, I just realized that most commercial audiobooks, like the ones by Audible, use some sort of DRM, so that adds even more layers of manual effort to prepare. The amount of complexity involved with syncing using WhisperSync and a service like Audible, degrades the user experience, while using overengineered, complicated solutions only to avoid legal issues, compromising the user experience on the way.

The work I'm doing to make a working text-to-speech browser extension (currently supporting all Chromium browsers but maybe Firefox later), which is also available as a standard Web Speech Synthesis voice, would potentially have larger impact and accessibility in practice, since it integrates directly to the web browser, though I'm not sure how many people would make the effort of locally downloading and installing an Echogarden server (so a cloud-based service may be needed in addition).

Astorsoft commented 4 months ago

On-demand TTS or audio alignment would be interresting feature on PC but I think most people end up reading their ebook/listening audiobooks while not sitting at their desk and usually rely on their smartphone (or at best, e-ink readers but they never really reached mass market). I wished their was an app as good as Thorium on mobile but the closest I've found so far is Colibrio.

And yes, true, Audiobooks and Ebooks coming from commercial platforms have DRM, which I think is despicable. Those DRMs can easily be broken using Calibre plugins or tools like libration for audible but that's another story. Not only that but Amazon (and publishers) get to decide what is WhisperSync compatible and what isn't, and some publishers seem to refuse to support it for unclear reason (like the Harry Potter serie which is a recurring complaint on reddit) or even drop support without prior notice.

Supporting those edge cases is probably too much though, it's up to the user to provide a DRM free content. To be fair I'm pretty sure you can't use apps like Thorium on DRMized Ebooks anyway, so if their book is already outside of Kindle App/Audiobook outside of Audible that means the DRM is no longer an issue.

Astorsoft commented 4 months ago

Anyway, I do have a text-to-speech browser extension I haven't released yet. It connects to the local Echogarden server, which then processes and streams audio back to the browser. Here's an example (unmute to hear the speech)

Finally got the chance to watch the video, the quality of TTS is amazing compared to stuff like espeak or builtin OS tts. Actual audiobooks will always be better than TTS as readers are usually actors that put a lot of effort into changing their voice across characters or reflecting their current emotional state. For webpages or content without professional audio counterpart however this is incredibly good.

rotemdan commented 4 months ago

The same voices and quality are available on the Microsoft Edge browser, without any limits (Echogarden actually supports connecting to the Edge cloud service, but requires knowing a "Trusted Client Token" I can't include in the source code). The synthesis quality isn't actually considered to be the most state-of-the-art, like, say Elevenlabs (also supported by Echogarden), but is very good and usable.

The Microsoft speech service still makes some pronunciation errors for some ambiguous words, like it did with 'read'. (Actually the Echogarden Heteronym resolution, for American English, I added, is sometimes better than Microsoft's own and gets some more accurate pronunciations for some words - though I can't apply it on cloud voices, only some local ones)

Anyway, I don't know how many pairs non-DRM eBooks, combined with non-DRM audiobooks are available (legally).

There is project Gutenberg and Librivox, which has close to 20,000 public domain audio books. These are old books though, that not many people want to read

Other than that, copyrighted books and audiobooks present complex issues which make it difficult to give the user a streamlined experience. Apparently it's even difficult for a company as large as Amazon, which owns both Kindle and Audible.

As it turns out, since synthesis is improving to human-like level, and is easier and more accessible, to integrate in a streamlined way, it sort of becomes the "default" for this sort of synchronized experienced. It's of course nicer to have human-read audiobooks, but the availability and growing quality of text-to-speech would possibly make that more of a more expensive niche.

I have to prioritize things, since I can only put small amount of total effort into this.

Currently synthesis is where I'm most concentrated in (that was where Echogarden started). Though it turned out that the Transcript alignment functionality I've added, which is very similar to the algorithm used internally by Aeneas (Dynamic Time Warping with eSpeak voice as reference), evolved into being good enough to reach word-level accuracy in many languages. That was mostly the result of a lot of hand tuning and additional enhancements I made, and trial and error experimentation. There were a lot of small details to get right, like getting the right length of pauses between sentences and paragraphs, and working around terrible eSpeak bugs.

I'm aware this can be used to add synchronized audio to EPUB3 eBooks, but I would prefer the results to maybe be more reusable and freely available, like, say, a large scale effort to use LibriVox recordings to add it to Project Gutenberg books. Unfortunately that's more of a crowdsourcing project, since parts of it can't really be automated.

I will look into it, but the pace things are going, it could take a lot of time. I don't know for sure.