Alignment: DTW may give inaccurate results due to silent or non-speech sections

Tendaliu commented 1 year ago

Hi, it's me again

I've been encountering an problem with the text alignment feature. Specifically, the subtitles often appear earlier than the corresponding audio. There are also instances where they appear later than the audio.

Is there a known solution or workaround for this problem?

rotemdan commented 1 year ago

Thanks for the report.

I'll need lot more information.

Which engine did you use here? dtw or dtw-ra. If you haven't tried dtw-ra (--engine=dtw-ra)? It may work better since it is better in dealing with silence and non-speech segments.

What is the total length of the audio? The default setting for dtw.windowDuration, which is 120 seconds, will only work correctly for audio that is up to about 10 minutes long (or a bit more than that). This setting may need to be increased for a longer durations.

In the image you are showing, the starting timestamps seem to be too early. It isn't clear exactly if it has to do with how the algorithm deals with silence, or with something else.

The best thing to do is to send the audio file and its transcript. You can attach the two files to the message. You can use Echogarden itself to export the audio to an .opus file (by adding a .opus file as output to the alignment operation), which should produce a small compressed file (32kbps), even for longer audio.

Tendaliu commented 1 year ago

I didn't set the engine so it should be DTW test.zip

rotemdan commented 1 year ago

Thanks for the files, I'm investigating them..

dtw-ra is slower, but works a bit differently and can deal with harder inputs (more noise, even background music). It includes a speech recognition phase (by default it uses the Whisper engine with its tiny model). The timing it outputs is based on the recognizer's time, which doesn't usually have major inaccuracies or offsets like you show. However, the Whisper model can get stuck in loops sometimes, may skip entire segments, and when given cleaner audio, is not as per-word accurate as dtw. If your goal is mostly subtitles, you can also try dtw-ra, when dtw doesn't work that well. It may be better, as good, or worse, depending on the audio.

Tendaliu commented 1 year ago

rotemdan commented 1 year ago

I see that the alignment is outputting the initial text of each line during the pause, which is too early. This is something the DTW alignment sometimes does. I can try to improve it by adding "whitespace" tokens that try to encourage the alignment to match whitespace with silence.

Anyway, this also revealed an unrelated issue with dtw-ra I don't understand yet:

npx echogarden align test/audio.mp3 test/tex
t.txt out/out.json out/out.srt --engine=dtw-ra --play
Echogarden v0.10.11

Prepare for alignment.. 1376.4ms
No language specified. Detecting language.. 249.5ms
Language detected: Chinese (zh)
Get espeak voice list and select best matching voice.. 298.8ms
Selected voice: 'sit/cmn' (cmn, Chinese)
Load alignment module.. 0.5ms
Prepare for recognition.. 96.4ms
Load whisper module.. 2.7ms
Load tokenizer data.. 390.6ms
Create ONNX inference session for model 'tiny'.. 5769.8ms

Prepare audio part at time position 0.00.. 0.8ms
Extract mel spectogram from audio part.. 88.7ms
Normalize mel spectogram.. 57.1ms
Encode mel spectogram with Whisper encoder model.. 1146.8ms
Decode text tokens with Whisper decoder model.. Error: The number NaN cannot be convert
ed to a BigInt because it is not an integer

I saw your comment now, while I was writing, yes, that's the same error.

rotemdan commented 1 year ago

The error Error: The number NaN cannot be converted to a BigInt because it is not an integer

Doesn't happen when I apply recognition to the same exact audio file with the same engine (whisper) and model (tiny). I can see the Chinese transcript being recognized. This is really odd.

Anyway, thanks for opening the issue, it revealed an unrelated bug!

rotemdan commented 1 year ago

The issue is now resolved at 0.10.12. It happened because a long language code was passed to the model (language detection returned the full zh-CN and it didn't shorten it to zh). The long language code was translated to undefined and then a token of NaN. Anyway, now it will also error properly if a language code is unsupported by the model.

dtw-ra should work properly now.

I'll keep looking at trying to improve the alignment around silence (for synthesized outputs, this matters less because I usually apply alignment on single sentences and trim any silence).

Tendaliu commented 1 year ago

Do you know about CapCut? I'm curious about their technology right now because they also have similar features.

rotemdan commented 1 year ago

I didn't know about CapCut, but I do know Descript. There are also plugins that can be used in Adobe Premiere (I don't remember the exact names). The speech field has many many commercial projects and startups (though almost all are cloud and web-based), but very few open-source projects (aside from research oriented ones, that are usually very hard to use and often lack basic features), especially ones which provide the extras needed to make it accessible to end-users, like subtitle generation, codec conversions, easy configuration, etc.

Even commercial services don't usually give you all the features that a tool like this could possibly do, like generate outputs in multiple formats, subword-level timestamps, automation, unlimited use, and high level of flexibility in configuration.

My goal, at this time, is to develop code that would potentially become usable enough to be practical for various purposes. I don't really think anything special about these services. They get good quality by using slow or custom models and run them on very fast GPUs on the cloud. This program can't take that route since it does it locally.

Anyway, for alignment. There are still some areas needed to be worked on.

The Whisper models sometimes get stuck in a loop, which also impacts dtw-ra. As you can see even with the audio you gave me, it got stuck in 干��, 干��, 干��, .... This problem happens with the original model as well and even the OpenAI developers couldn't find a definite solution for it, except workarounds and various heuristics to detect it. I have an experimental method that suppresses this to some level, already in the code, it's just commented out for now. You could try a larger whisper model for now like --recognition.whisper.model=base and see if it makes it better.
dtw sometimes handles silence or non-speech in a weird way (this issue). I believe this can be improved by adding whitespace tokens. I had them in the past, but removed them at some point.
Alignment requires a huge amount of memory for long audio. This is slightly less of a problem with dtw-ra by the way.

rotemdan commented 1 year ago

I found a way to work around the problem.

If I trim the silence at the beginning and end of the audio file, the alignment looks better with dtw. The silences in the middle of the audio do seem to match the line breaks:

This zip file contains the same audio file, only with the starting and ending silence trimmed: Chinese1.zip

I will add auto-trimming of silence on the beginning and end of the audio before alignment.

Haven't tested with dtw-ra yet, the problem with repetition has to do with the relatively long 2 second silences the audio has within it in many places. The Whisper model sometimes hallucinates or loops, due to this silence.

I could also trim any silence within the middle of the audio to help with this, before passing to the recognition engine. However, this would require more complex mapping to later take those silences in account in the timing for the original audio. That's a part of the reason I haven't been doing that for now.

Edit: because your transcript has single line breaks as separators, consider setting --plainText.paragraphBreaks=single, so every line would be considered as a sentence, if you want.

rotemdan commented 1 year ago

Just realized the code already trims ending silence based on a -40dB threshold (but not start silence).

I came back to your example and tried again with various settings, to see if it does any difference (including the new dtw.granularity setting I introduced today - which you may want to check out, since it can enable faster alignment, with much less memory [up to 10x less or more], for longer audio durations).

The trimming doesn't seem to much a lot of difference for this particular example.

I think I had a lag on my playback I misidentified as the issue you are describing.

Seems like your issue is more subtle. The problem has to do with how silent sections are mapped to non-silent sections due to how the dynamic time warping algorithm works.

I'll keep investigating. I have some new ideas. Possibly: try to trim silence from individual word start and end timestamps based on silence detection of the particular sample range. It will require some new experimental code to be written first, and testing to be conducted to ensure it doesn't cause unintended effects.

Tendaliu commented 1 year ago

Thank you for your diligent efforts in addressing this issue. I truly appreciate the time and energy you've invested in troubleshooting and exploring solutions. Looking forward to the developments

rotemdan commented 1 year ago

The issue should be mostly resolved in 0.11.3:

After step-debugging into the code, I realized that for some reason, given Chinese text, eSpeak doesn't introduce pauses when there are line breaks in the text. However, with other languages, it does (maybe this can be related to how Chinese is read from text).

The problem was that since there are no punctuation marks, and line breaks are ignored, there are no pauses in the reference synthesized audio (the eSpeak audio used to serve as the reference when aligning). Basically all words are said without any breaks, this causes the alignment to match some of the pauses to characters.

This file contains an example of the reference synthesized audio generated by eSpeak when there are no breaks between lines (c-nobreaks.mp3) and when there are breaks (c-breaks):

c-breaks-nobreaks.zip

The version with breaks acts as if there was both a period and a full paragraph break between each line:

南国名城.

侨乡新会.

新会古称“冈州”.

地处珠江三角洲西部.

这里全年四季分明.

气候温和.

雨量充沛.

被评为广东省最美的生态乡村.

江门市美和食品有限公司.

...

In order to make this work with alignment, in 0.11.3, I've changed the alignment code to use the full call to the synthesis API when producing the alignment reference (before that, it used a lower-level method that didn't perform the complex preprocessing and postprocessing that the full synthesize method does).

So now, to ensure you get good alignment for text like:

南国名城
侨乡新会
新会古称“冈州”
地处珠江三角洲西部
这里全年四季分明
气候温和
雨量充沛
被评为广东省最美的生态乡村
江门市美和食品有限公司
是一家集新会柑种植
陈皮茶生产加工和仓储陈化于一体的
新会陈皮全产业链经营企业

You must add --plainText.paragraphBreaks=single. This would treat every line as a separate paragraph, causing the reference to include a 1 second pause after each line spoken.

These pauses in the reference, will, in turn would make it easier for the DTW alignment to match between the pauses in the synthesized reference audio and the source audio, since the pauses have more comparable lengths.

So your command should be:

echogarden align audio.mp3 text.txt --plainText.paragraphBreaks=single

You can also add --dtw.granularity=high to ensure the accuracy of the test (for this particular test - I mean, but in general, if you're only producing subtitles, you can use a lower granularity, or just omit it and let it be chosen automatically).

Thanks for reporting this. This fix may improve the quality of alignments in all languages.

(along the way I found an unrelated issue in the timeline produced by eSpeak, which is also fixed now)

Tendaliu commented 1 year ago

I tested it today. The end result is the same. There are no pauses or gaps between each subtitle. However, the corresponding audio sections are silent. The ideal outcome should be like this:

rotemdan commented 1 year ago

I tested the same input audio with:

echogarden align audio.mp3 text.txt --plainText.paragraphBreaks=single

And there is definitely a significant improvement. I can see that the timing of each line start is now correctly synchronized, and doesn't anticipate it anymore. I used this particular example as a guide and tested it many times. I'm not sure what you saying.

Are you sure you are using the latest version? The improvement only occurs with the latest version.

Tendaliu commented 1 year ago

I don't know what was going on with my brain earlier. Yes there has indeed been an improvement. The starting times of the subtitles now largely align with the audio. The only issue is that the end times don't align; they continue right up to the appearance of the next subtitle.

rotemdan commented 1 year ago

You haven't made it clear if you saw any improvement, at all?

Are you now referring to each subtitle cue appearing too early? or extended after the speech ends? They shouldn't appear too early now. That was the issue that I worked on.

Based on what you described, it may be that now you may be describing the fact that by default, each cue has 3 seconds added it (or up to next cue) to ensure it's visible? Is that it?

If that's the case, you can set --subtitles.maxAddedDuration=0, is that the behavior you want?

Edit: the reason it behaves like this is that it is primarily designed to make readable subtitles that appear long enough when watching a video. It isn't designed, by default to give you accurate time boundaries of the text.

However in the next versions I'm planning to add several alternative subtitle modes. There's going to be a mode where each word has its own cue (a feature I need for testing anyway), and also a mode where sentences are not always aligned to start of cues, that is, a new sentence can start in the middle of a cue. Some users may prefer that.

It will probably require rewriting most of the subtitle generation code, though.

Tendaliu commented 1 year ago

I'm not sure why this particular subtitle went wrong. But aside from that, the results are great with --subtitles.maxAddedDuration=0.

rotemdan commented 1 year ago

At 2:26.250, there is a 2 second pause, which is the longest in the audio:

Screenshot_5

For whatever reason the DTW alignment matched some of this pause to the beginning of the next line, rather than leaving a full gap:

59
00:02:23,960 --> 00:02:26,259
粒粒都是独特陈香

60
00:02:26,520 --> 00:02:30,120
品质铸造价值

The silence in the reference synthesized audio was only mapped to the 261ms range 00:02:26,259 to 00:02:26,520:

Screenshot_7

Other than that, other cues seem to be accurately bounded:

Screenshot_6

In the future I'll add removal of silent and non-speech segments of the audio, using VAD (voice activity detection), so that particular case may also improve.

(If you're wondering, the visualization is in REAPER. I'm using a special script that automatically imports subtitle files as regions.)

Edit: I can also implement the idea I suggested before: go through each word entry in the timeline and try to see if its timing includes excessive silence preceding or succeeding it, and then adjust its timestamps to remove the silence.

Tendaliu commented 1 year ago

I think the problem At 2:26.250 is same with the first cue, which also have no gap and start exactly at 00:00:00,000

rotemdan commented 1 year ago

The reason the first cue includes the silence is a slightly different. It's because the synthesized reference doesn't have any silence at the beginning, and the way DTW works is that it always matches the first element of the first sequence (here the synthesized audio) to the first element of the second sequence (here the target audio). Same goes for the last element.

Because the synthesized reference audio doesn't have silence at the beginning, it naturally tends to match the first speech segment to include the target audio's silence as well.

One solution is to add silence to the beginning of the reference, but in general it isn't necessarily a good idea.

Another solution to this is to perform individual trimming of phone and word entries on the timeline: go through each one, check if the time range it references includes significant silence or non-speech at its beginning or end, and then trim the individual timestamp to remove it.

This is possible to do. Though I would prefer to do it properly and use a VAD (voice activity detector) to handle not only silence, but also other non-speech sounds like background noises, ambient noises etc.

I can maybe apply VAD to the entire audio at once, and then look up, for each timestamp range, if it contains non-speech. However, there are some possibilities of false positives, since normal speech naturally has short pauses within it (plosives are an example, and fricative consonants may be confused as non-speech). I'll need to find a set of rules to detect cases where it is highly likely that the non-speech part is safe to trim from the timestamp range.

rotemdan commented 12 months ago

In 0.11.12 it now trims individual time ranges to remove preceding or following silence within mapped entries (mapped words, phones) after alignment. Silence detection currently uses a threshold of -40dB peak amplitude (this is after the source audio waveform is pre-normalized to a peak amplitude of -3dB before alignment).

Using voice activity detection for this instead would be much more error prone, since VAD produces a lot of false positives.

For silence detection to work, it means the audio segment must be very quite for trimming to occur. If there is background music, relatively strong ambient noise, or a high noise floor (reasonably strong hiss or hum, etc.), no trimming would be done.

Tendaliu commented 11 months ago

Hi, I've been occupied with other matters previously. It appears that the program doesn't incorporate voice isolation, which may lead to misalignment when dealing with audio that has background music. Have you considered adding a voice isolation process? Also, are you aware of any existing voice isolation projects that could be integrated?

Tendaliu commented 11 months ago

Also, I write a plugin with your program for Davinci Resolve, which is great

rotemdan commented 11 months ago

You can use the slower dtw-ra engine (--engine=dtw-ra), which uses speech recognition step, and works much better for audio that has background noise and music. By default it uses the Whisper tiny model.

Also, there is direct support for speech denoising using RNNoise, made by Jean-Marc Valin at Xiph.org (same ones who made OPUS, SPEEX and Vorbis codecs). It's one of the very few open-source speech noise reduction libraries available.

RNNoise was published in 2017. It is very fast and based on a tiny neural network in modern standards (compared, say to something like RTX voice which was a few gigabytes the last time I tried it).

I can add it as a preprocessing step for standard speech alignment (dtw engine), but I'm not sure the result would be any better than dtw-ra in most cases - most likely worse (it's been my task list for a while but I haven't thought it that significant).

The CLI guide mentions it. You can test its effect now. Here's a part of the CLI:

Speech denoising

Task: try to reduce the amount of background noise in a spoken recording.

This would apply denoising and play the denoised audio:

echogarden denoise speech.mp3

This would apply denoising, and save the denoised audio to a file:

echogarden denoise speech.mp3 denoised-speech.mp3

Commercial tools:

Nvidia makes some of the best voice isolation models that I know of, it made some naming changes over the last few years: NVIDIA RTX Voice NVIDIA Broadcast App (newer than RTX voice - never tried it)

Expensive, but very good quality and UI for static noise reduction and other operations: Izotope RX

There are many more, but I'll have to research to find them. Also, speech noise reduction is one of my personal interests. In many algorithms you usually take a sample of the noise only part, so they require an extra step that cannot be done in a command line application.

In any case, the simpler algorithms are mostly about reducing static noise, not music! For music you'll probably need a large neural network like RTX Voice, etc.

Anyway, to sum it up, try --engine=dtw-ra, it can work with background music.

Pimax1 commented 10 months ago

Hi guys, I am having a hard time to get proper results using dtw. I get completely off sync results (see the zip file attached).

dtw-ra works very well but unfortunately its processing time is way tpo long for my use case. I think you guys get correct results with dtw, so there must be something I am missing.

As I need a timing for each word (and keep the punctuation), I build my transcript like this with one word per line : Once upon a time, ...

The audio I am using is crystal clear (generated with tts), speech is fast and silences are usually short. My audio files vary in length from few seconds to 1 minutes.

my settings : --engine=dtw --plainText.whitespace=preserve --subtitles.minWordsInLine=1 --subtitles.maxLineCount=1 --subtitles.maxAddedDuration=0 --subtitles.maxLineWidth=100 --plainText.paragraphBreaks=single --overwrite

I think my problem is because I want a word level timing without loosing punctuation so my transcript is 1 word per line. Any settings I should change to get proper results ?

Thanks !

166.zip

rotemdan commented 10 months ago

I tried to run echogarden align 166.wav 166.txt (with no options) and it looked mostly accurate. By default it converts all line breaks to spaces and then processes it normally.

When you are preserving whitespace with --plainText.whitespace=preserve the alignment wouldn't work correctly since the reference synthesized audio produced during alignment (using the espeak engine) inserts long breaks (a second or more) between each word, and effectively treats each word as a separate sentence.

You can just use the normal text (or current one without --plainText.whitespace=preserve) and set the new word mode for subtitles (I added it in release 0.11.14, on 13 September 2023).

echogarden align 166.wav 166.txt 166.srt --subtitles.mode=word

Here is a part of the resulting .srt file:

1
00:00:00,127 --> 00:00:00,430
Once

2
00:00:00,430 --> 00:00:00,680
upon

3
00:00:00,730 --> 00:00:00,734
a

4
00:00:00,800 --> 00:00:01,290
time

5
00:00:01,599 --> 00:00:01,599
in

I haven't tested if these timestamps are accurate for the particular input, though.

Also you can output the .json timeline file, which includes paragraph (segment), sentence, word, token and phone timing:

echogarden align 166.wav 166.txt 166.json

No manual preprocessing of the text is needed in either case.

Edit: On the version I released today (0.11.15), I exposed a method (timelineToSubtitles) that allows to perform timeline to subtitles conversion via the Node.js API, with all the options the CLI provides, if that helps.

Pimax1 commented 10 months ago

thank you for such a fast answer, indeed its accurate with --subtitles.mode=word only but its looses all punctuation in the srt.

I was just trying to find a way to preserve the punctuation in the output because i can get very complex punctuation cases.

I might just build back the output i want mapping the missing punctuation characters for each sentence from the json.

Just a note, I managed to make it out put one word per line keeping the punctuation using only echogarden align 166.wav 166.txt 166.json --plainText.paragraphBreaks=single

but results are completely off too.

anyway thanks for your fast answer!

rotemdan commented 10 months ago

The reason I chose the word mode not to include punctuation, is that it's derived from timeline entries, which intentionally avoid having punctuation in words to allow these words to be analyzed and processed correctly (also, in general it's more correct not to include extra characters in a word).

There are many types of punctuation that may precede or follow a word, it can be something like ., , ! ?, but also longer sequences like )., )", "., "), etc. Punctuation can precede a word like "hello, 'hello, ('hello, etc.

So even having a mode that includes punctuation, it isn't clear what types of punctuation to include, should it be preceding? following? only one character? more than one character?

I understand that in some applications, like in some forms of video creation, they want to make a sort of word by word slideshow, so having only simple following punctuation like ., , ! ? can be useful. But in other applications, punctuation wouldn't always be desirable.

I can add a subtitles mode that includes some punctuation, but it's not yet clear to me which punctuation patterns to include.

A workaround is to use the JSON timeline (or object if via that Node.js API), take the end offset of the word via the endOffsetUtf16 or endOffsetUtf32 properties, and then append any following punctuation character(s) at those positions, if found.

You can then call the new timelineToSubtitles method with the modified timeline to generate the subtitles.

Pimax1 commented 10 months ago

I see, makes sense to keep only words in output. I will just build back my desired output as you suggest. Excellent work by the way!

rotemdan commented 10 months ago

@Pimax1

In version 0.11.16 (just released), I added a new line mode to the subtitles generator:

You can use it with --subtitles.mode=line.

What it does, is generate a single cue for each line of the original text in the generated subtitles.

This means that you can specify exactly how you want each cue to look like, by putting the text for each cue in an individual line. It includes any punctuation or even space characters in the line (regardless of plainText settings).

This can be useful for video creators that want to have a 'word slideshow' in the style of:

Welcome 
to my
channel!

Today,
we'll
talk
about
how to
make a
...

Currently it only generates single-lined cues however - it doesn't generate multiline cues if the line is too wide. I could add that in a future update if needed.

All other settings except maxAddedDuration are ignored.

In the API, you'll have to pass the exact original text via the originalText property for this to work (otherwise you'll get an error).

I also added the optional totalDuration property to the configuration object, so that when maxAddedDuration is used, the last cue is also extended, up to this value (otherwise it is not extended - since timelineToSubtitles doesn't know the total duration of the audio, and it's not included in the timeline).

This was directly inspired by your feedback. Please let me know if there are issues or possible enhancements.

echogarden-project / echogarden

Alignment: DTW may give inaccurate results due to silent or non-speech sections #23

Speech denoising

Commercial tools: