Feature Request: Rate-independent audio search/match

Howdy! I'm the developer of Storyteller, which is a platform for automatically syncing ebooks and audiobooks. I'm interested in migrating to echogarden for the narration syncing, but I was wondering if anyone would be able to talk through some points with me, first.

I guess, first, some context. Because ebooks and audiobooks don't always have exactly the same content, and their contents aren't always in the same order (e.g. meta content, such as dedications, are often at the front of the ebook, but at the end of the audiobook), Storyteller has a multi-phase syncing process:

Transcribe the entire book to text (using Whisper)
For each text chapter, use fuzzy matching to find the place in the transcript where it starts
Starting at that offset, use a custom windowed fuzzy matching algorithm to attempt to find the start and end timestamps of each sentence in the chapter.

In an ideal world, we would be able to simply treat the ebook text as the "transcript" and use echogarden's align function to find the start and end timestamps for each sentence, but the ordering issue mentioned above I think precludes this approach. I would still be happy to switch to echogarden for just the transcription step, but I figured I might reach out first and see if anyone had any ideas about how I might be able to use echogarden's alignment functionality to replace the entire sync process, including the chapter ordering step.

Either way, thank you for this project, it's going to be a lifesaver for me!

If the default DTW-based alignment engine is applied entire text and audio, then it's likely to get confused by areas of text that are not included in the audio at all (and vice versa).

They way it works is that it synthesizes the text using eSpeak and then takes the two audio sequences and tries to map, for every synthesized audio frame (encoded as its first 13 Mel Frequency Cepstral Coefficients), one or more source audio frames.

So, if some range of the synthesized audio doesn't have any equivalent in the source audio then it will still match it to a range of samples, although due to the dissimilarity it is likely that that it would match it to comparatively short time range, and then quickly advance to better matching ranges.

The Dynamic Time Warping algorithm is a form of sequence alignment algorithm. It is designed for temporal sequences that are assumed to have similar content, but with different rates, and where the rate can internally vary to become higher or lower (different parts of the time series are "stretched" or "contracted" relative to each other).

The algorithm is very fast since it doesn't use any speech recognition, or actually any machine-learning algorithm at all.

The other alignment engines, dtw-ra and whisper, also primarily work on the assumption that the text and audio contain the same content, although they do use speech recognition and are intended for more difficult audio where there are elements like loud background noises or music.

The whisper alignment engine (using a kind of forced decoding over the transcript) is very robust to music and other noises, and can be used for Karaoke alignment. It can actually skip ranges of non-speech (for example a short instrumental part), but when it encounters speech, it will nearly always match to it. Unlike DTW, it works sequentially and doesn't have a global understanding of the entire audio, beyond the 30 second chunks it works on.

What you are describing is not only an alignment problem but also includes a kind of search element to it. You want to isolate ranges of text and audio that match each other. This toolset doesn't currently have something that is designed to do exactly that.

If you can isolate individual chapters as contiguous ranges of text and audio, then you can apply the alignment to those ranges individually.

To find the time offsets for different chapters, maybe it's possible to use other techniques, maybe even ones that don't use speech recognition at all.

We could synthesize the text, extract audio features for both the source and synthesized text, and then use a kind of scanning procedure to find the best matching offset for, say, a few seconds of the start and end of the synthesized segment for that part of the text.

The thing is that this search-oriented kind of audio feature matching also has to work independently of the rate of the speakers (real and synthesized), but because we are only matching prefixes, and not entire sequence ranges, DTW may not be the optimal approach.

It sounds more like something that would benefit from a kind of rate-independent acoustic fingerprinting. Basically extract a fingerprint of the first, say, 10 or 20 seconds of the synthesized audio, and then sequentially scan the source audio to match that fingerprint to its features (more optimized for multiple searches: use some sort of approximate nearest neighbors matching for the fingerprints into a pre-computed "database" derived from the source audio).

I'm actually working on developing a form of acoustic fingerprinting method right now, but a non-typical one - not for audio or speech search (more like for audio compression, which is a very different thing).

Anyway, using acoustic fingerprints to match "landmark" locations in the synthesized audio, into the source audio, could be an alternative that is faster (possibly much faster) than using recognition. Then, applying the dtw alignment to the individual matched segment would be much, much faster than using recognition, so overall it produce an efficient, ML-free approach.

Thank you so much for taking the time to think through this with me! Audio fingerprinting is a great idea for the search step, I'm going to start there and see if I can get that working.

I investigated a bit about acoustic fingerprints for speech and for varying rates. Seems like most fingerprinting methods are for music or sound effects, not specialized for speech, and are not rate independent.

During a conversation with Claude 3.5 Sonnet about this I had several ideas.

I suggested various approaches, like using "beam search" as an alternative to DTW (more like a hybrid of beam search and DTW-like matching). Later, it turned out what I was going towards also resembles to be variant of DTW called "subsequence DTW" that may be helpful for this kind of search task.

Basically, standard DTW always "anchors" the first frames of the two sequences to be matched together (represented by top-right of the accumulated cost matrix), as well as the last frames of the two sequences to be matched together (bottom-right of the accumulated cost matrix).

Graphical-illustration-of-the-DTW-algorithm-23

However, we don't necessarily have to. Since the first stage of DTW produces an accumulative cost matrix, its last column contains the cost of all partial matches between the target sequence (the audio "snippet" we are trying to match) and the candidate audio segment (which may be longer or shorter than the snippet).

So we if we sequentially scan the audio to try to match overlapping candidate segments, and take a duration of, say, 2x or 3x of the duration of the snippet we are searching for (to allow for slower speech, to be matched), and compute the accumulative cost matrix between the snippet and each segment, then its last column would contain the costs of the best paths for all prefixes of the segment (the first frames are still anchored together, only the last ones aren't). The actual paths can be found by "backtracking" from the final cell to the start cell - the same way it's done in standard DTW (only usually only from the bottom-right cell).

I think this could become a general purpose audio search method, though at this point I don't exactly know how fast it's going to be. With low granularity for the MFCC features, and relatively short snippets, it may be fast enough to be practical.

Since I already have efficient code to compute the audio features and accumulative cost matrix, I could try to write an experimental audio search method reusing some of it. I'm currently interested in fingerprinting and matching in general, so this came at the right time.

For actual acoustic fingerprints, I had some other ideas, but they aren't going to be as accurate as this. They can maybe used as "rough", initial screening methods if they have low likelihood of false negatives (false positives are fine for initial screening).

It's been a long time since I thought hard about audio processing/analysis (though music information retrieval was my first area of expertise!), so I don't know how much help I'm going to be in the actual planning or execution of such an implementation, but... this sounds fantastic! I would be endlessly grateful to you for putting the time into this. Please don't hesitate to reach out if there's any way that I can help, even if that just means testing some experiment or being a rubber duck!

Hey @rotemdan! I wanted to check in on this; not to rush you, but just to see how it's going! It seems like a challenging problem haha. Do you still feel like you have a clear path forward?

I haven't got to this yet. I have other things I'm currently working on for the next release. For example I'm working on adding machine translation support (Google Translate and Facebook Research NLLM model) and hybrid speech-to-text translation that would add support for many target languages (recognize in native language -> translate text -> align timeline to translation). I'm not done with that yet.

The algorithm itself is pretty clear to me, though for practical implementation it requires dedicated code for the DTW variant it's using:

Say we have n frames in the snippet and m frames in the audio we are searching,

First compute cost matrix which is O(m*n). That can be done in reasonable time (similar time scale of existing DTW alignment operations).
Then, for each subsequence of the searched audio, of length n*scale (where scale is something like 2 or 3) compute an accumulated cost matrix and take the rightmost column (most likely add its content to a top-k priority queue). The accumulated matrix computation doesn't include any distance calculations, it's just walking through the cost matrix and selecting directions, so it should be fast in practice.

The overall number of iterations in the algorithm is slow on paper: O(m * (n * (n * scale))) = O(m * n^2 * scale). That makes it hard for me to predict how fast it would run on very long audio and snippets. It is parallelizable though, but that's not something I'm doing right now.

In practice, it's possible this algorithm could produce results that are good enough by themselves, or for more accurate results, be used as a screening stage before applying a speech recognition engine on the top candidate audio ranges found. Using forced decoding (like I do in the whisper alignment engine) I could then extract probabilities from the recognizer to score each candidate (basically forcing Whisper to decode the tokens of the snippet and observing the probabilities it assigns to them).

Getting all of this working and optimized would take some effort. Especially adding an extra recognition phase. I'm definitely interested to see how this would perform in practice, but I first must finish the other priorities.

Thanks for all of that explanation, I completely understand that you have higher precedence stuff to work on first, and that this is a big project!

I think my plan will be, in the meantime, to attempt to migrate over to echogarden, but continue using the current approach of text-based chapter search. It will still be a huge win to have support for languages other than English, and I suspect echogarden's alignment implementation is more performant than my current, homebrewed approach. This will make it a smaller step to move to audio-based search once it's available, too!

Thanks again for all your work and support here!

echogarden-project / echogarden

Feature Request: Rate-independent audio search/match #59