Error on Speech-to-transcript alignment for large audio files

echogarden-project / echogarden

Easy-to-use speech toolset. Written in TypeScript. Includes tools for synthesis, recognition, alignment, speech translation, language detection, source separation and more.

GNU General Public License v3.0

204 stars 21 forks source link

Error on Speech-to-transcript alignment for large audio files #51

Open ezekiel747 opened 6 months ago

ezekiel747 commented 6 months ago

Hello, I'm trying to use the speech-to-transcript alignment on large audio files (6+ hours duration). I'm getting the error below. I tried with two different files (one almost 6h, the other 9.1h).

I wasn't able to find this exact error anywhere else, which is quite strange. I'm on Macbook M1 (latest stable Mac OS). I have ffmpeg 6.1.1 I can provide the files i used, if needed.

echogarden align audio.mp3 text.txt align.srt 
Echogarden v1.3.0

Transcode with command-line ffmpeg.. 26802.6ms
Error: Attempt to read a freed WASM typed array reference.

I tried manually with a smaller chunks of audio & text (cut from the same files) and it seems to work. However, i can't do this manually for all the files. How to debug & work around this issue?

Thanks in advance! Great work btw!

rotemdan commented 6 months ago

Based on the log messages, it's likely the downsampling is failing due to the large audio memory size.

The downsampling is done via the Speex resampler WebAssembly module.

The standard, current version of WebAssembly doesn't currently support arrays that are about 2GB / 4GB or more (not sure what's the exact limit that holds). WASM64 will remove that limit but it's just a proposal and not deployed in any runtime (without a flag). To perform the downsampling, the entire audio is passed to the WebAssembly module, so it looks like it may be failing to allocate it.

I could try to pass it in chunks. I think the Speex resampling library supports that. I'll look into it.

For now, you can try downsampling to mono 16Khz wave file using another tool like ffmpeg or, say foobar2000, and then using the downsampled wave file instead as input. If it detects this format then no downsampling should be performed.

I don't think that, other than the downsampling part, alignment and recognition operations require passing the entire audio to WebAssembly.

There may be still a limit of 2^32 elements for the Float32Array audio sample data, but that's like 16GB of memory (4 bytes per element).

rotemdan commented 6 months ago

I also now noticed that I didn't compile speex-resampler to WASM with the -s MAXIMUM_MEMORY=4GB flag, like I do with other WASM libraries, like fvad, so it may be that the limit is set to some default lower value. I'm not sure what it is.

ezekiel747 commented 6 months ago

Thank you for your quick answer. As you suggested, i've downsampled the audio file to a 16khz wav. Now i get past that initial step, but encounter the same error message on a later step.

echogarden align --language=en --debug audio.wav transcript.txt align.srt align.json 
Echogarden v1.3.0

Transcode with command-line ffmpeg.. 6619.4ms
Crop using voice activity detection.. 42863.8ms
Prepare for alignment.. Language specified: English (en)
7057.6ms
Load alignment module.. 0.6ms
Create alignment reference with eSpeak.. Error: Attempt to read a freed WASM typed array reference.
    at Float32ArrayRef.assertNotFreed (file:///.../node_modules/echogarden/dist/utilities/WasmMemoryManager.js:346:19)
    at get view [as view] (file:///.../node_modules/echogarden/dist/utilities/WasmMemoryManager.js:323:14)
    at Float32ArrayRef.clear (file:///.../node_modules/echogarden/dist/utilities/WasmMemoryManager.js:334:14)
    at WasmMemoryManager.allocFloat32Array (file:///.../node_modules/echogarden/dist/utilities/WasmMemoryManager.js:145:55)
    at resampleAudioSpeex (file:///.../node_modules/echogarden/dist/dsp/SpeexResampler.js:32:41)
    at async createAlignmentReferenceUsingEspeak (file:///.../node_modules/echogarden/dist/alignment/SpeechAlignment.js:315:25)
    at async Module.align (file:///.../node_modules/echogarden/dist/api/Alignment.js:107:62)
    at async align (file:///.../node_modules/echogarden/dist/cli/CLI.js:499:115)
    at async startWithArgs (file:///.../node_modules/echogarden/dist/cli/CLI.js:221:13)
    at async start (file:///.../node_modules/echogarden/dist/cli/CLI.js:150:9)

What should i do next? Thanks!

rotemdan commented 6 months ago

Seems like what is happening is that now the synthesized speech, produced as part of the DTW alignment process, is causing the same error while it is being downsampled to 16kHz, so the workaround doesn't work in all cases.

Anyway, I'm working on this right now.

I've already modified the Speex resampler to process in chunks. This should solve the issue with maximum WASM memory size in a more through way, so this particular error shouldn't occur.

It seems to handle 1 hour audio files fine.

Now I'm testing with longer audio files, like 3.5 hours or more.

The Speex resampler now works fine with arbitrary sizes.

But I see now that the wave file encoder / decoder I wrote isn't handling wave files or buffers larger than 4GB, because they are beyond the standard specification. I'm working on handling these larger files by ignoring some of the chunk sizes and parsing in a special way, that works with the kind of WAVE output ffmpeg outputs in those cases.

ezekiel747 commented 6 months ago

Not sure if it helps, but: the downsampled audio.wav file i tested with is 1.06 GB in size (the original mp3 file was about 132MB). The total duration is 9h.

rotemdan commented 6 months ago

I've made the changes to the wave decoder and encoder to support lengths larger than 4 GiB.

I can now align a 3.5 hour audio file in about 5 minutes using the default dtw engine. The DTW cost matrix size is 22GB and peak memory usage gets to about 28GB.

I have 32GB of RAM so it works reasonably fast, but anything longer than about 3.5 to 4 hours could become very slow (will need to swap to disk). If this is an audiobook, it's probably better to first split the audio to chapters manually.

Maybe it would help to handle these longer durations by using multi-pass processing settings like with something like dtw.granularity=['x-low', 'medium'].

I'll publish the new version soon so you can test it.

rotemdan commented 6 months ago

I've released v1.3.1 with the fixes.

It should work with long audio durations, say, up to like 3 to 4 hours, but 9 hour file may be a bit too much for the dtw alignment engine, since it produces a matrix that grows quadratically with the number of audio frames. I don't know the exact memory requirements would, but it could get to a peak of 64GB - 128GB maybe.

If you want to process the entire thing at once using dtw, you can maybe try the multi-pass processing using dtw.granularity=['xx-low', 'low'] and specify window durations like dtw.windowDuration=[1200, 120] - these are just arbitrary values - I didn't test them.

On the other hand, the whisper engine is applied iteratively on 30 second windows so it can potentially process arbitrary lengths without consuming tens of gigabytes of memory. It's also more accurate, but much slower.

If you have an NVIDIA GPU You can try engine=dtw-ra and set recognition.engine=whisper.cpp with the default base or base.en model. That would be faster, but still probably not as fast as dtw (without swapping), unless you have an extremely powerful GPU.

I'll try to experiment these length of durations, and see if I can maybe tune the default dtw parameters to handle them. Currently the default logic is only reasonable to, say up to 2 to 4 hour durations, but not more. It doesn't automatically choose settings for multi-pass processing.

ezekiel747 commented 6 months ago

Thank you for the update. I've tested some values for the params you suggested. I've managed to get the aligned subtitles by using the --dtw.granularity=xx-low Although the last quarter of the subtitles is waaay out of sync.

The DTW cost matrix memory size: 3999.7MB - surprisingly (i only have 16GB ram on my Macbook M1). And i get this warning:

Warning: Maximum DTW window duration is set to 720.0s, which is smaller than 20% of the source audio duration of 31176.4s. This may lead to suboptimal results in some cases. Consider increasing window duration if needed.

I plan to try some more combinations of the params and values, to see if i can get an improved result. Again, thank you!

rotemdan commented 6 months ago

This is the current logic used to set the DTW window duration when it's not given:

    if (options.dtw!.windowDuration == null) {
        const sourceAudioDuration = getRawAudioDuration(sourceRawAudio)

        if (sourceAudioDuration < 5 * 60) { // If up to 5 minutes, set window to one minute
            options.dtw!.windowDuration = 60
        } else if (sourceAudioDuration < 60 * 60) { // If up to 1 hour, set window to 20% of total duration
            options.dtw!.windowDuration = Math.ceil(sourceAudioDuration * 0.2)
        } else { // If 1 hour or more, set window to 12 minutes
            options.dtw!.windowDuration = 12 * 60
        }
    }

So an audio duration beyond 1 hour, is always set to a window of 12 minutes (I didn't remember that, actually, since I wrote this more than a year ago).

This means that the alignment only looks in a range of 12 minutes around the interpolated location in the synthesized reference to try to find the best matching frame.

If you're giving it several hours, it's very likely that 12 minutes window may not be enough, especially if the audio contains areas of music, etc, that are not filtered out. (The reason I set it to 12 minutes is because larger sizes would get to tens of gigabytes of memory quickly - that was actually before I added support for lower granularities so maybe I should readjust the logic).

Even though you've set granularity to x-low, it should still be usable enough for subtitles. Low granularity shouldn't cause it to "lose track" any more than higher granularity, it's just about the accuracy of the timing of individual words.

The second pass, if requested, would, by default use a window of 15 seconds to refine the alignment found in the first pass. It may not be necessary for subtitles.

Try to increase the window size to 20 minutes (20 60 seconds), 30 minutes (30 60 seconds), or more. With these durations, it's likely the memory usage would become way over 10GB, 20GB though. It depends on the length of the audio you give it.

Also, another aspect I found is that Node 22 now allows buffers and typed arrays of arbitrary lengths (I've allocated a 35 GiB buffer successfully), but Node 20 and before allows only up to 4 GiB. I made some changes to try to reduce the peak memory when converting the wave buffer to raw audio, but it still produce several copies in memory, though. If you're passing the input as a 16kHz mono wave file, it should avoid some of these issues. If you're aren't using Node 22, then that may actually be the only way to load multi-hour audio files, currently.

rotemdan commented 6 months ago

I released 1.4.0.

I reworked the auto-selection of granularities and window durations. So now:

Less than 1 minute: high granularity, 1 minute window
Less than 5 minutes: medium granularity, 1 minute window
Less than 30 minutes: low granularity, 20% window
Less than 2.5 hours: two passes. First pass: xx-low granularity, 20% window. Second pass: low granularity, 15 second window
More than 2.5 hours: same as above but window stays at 30 minutes

I might change this in the future, of course.

I tested this on 6 hours audio and the memory requirement of the first pass was about 5GB - 7GB (the size actually depends on how many words the speech contains, which impacts the synthesized reference size) and the second pass much smaller. The two passes are actually so fast that they take a minority of the overall time. The other processing stages are now taking most of the time. I think it did it in about 300 seconds (5 minutes).

I also attempted to fix some core issues with how the DTW algorithm works with smaller window sizes, so now I'm more confident the multi-pass should work correctly, so I can select it by default. I removed all warnings like all cost directions are equal to infinity (I explained the reason for why that happened in the release notes) and replaced them by a particular strategy of how to deal with that.

There was also a completely unreported eSpeak issue that I got with about 20% of the multi-hour audiobooks I downloaded from YouTube, causing an error due to a missing marker (it was caused by isolated ' characters at the start of an utterance). I worked around it now, but I can't understand why nobody reported it. Maybe people are encountering errors and crashes and just somehow assume I know about them. If I knew about it earlier I'd have fixed it immediately..

ezekiel747 commented 6 months ago

Thank you for the changes. I've tested with several audio files with long durations, I can confirm it works fine, with either with 1-pass with granularity xx-low or the new 2-pass default value.