Recognize function: recognized words with embedded ellipses cause error with segment timeline

echogarden-project / echogarden

Easy-to-use speech toolset. Written in TypeScript. Includes tools for synthesis, recognition, alignment, speech translation, language detection, source separation and more.

GNU General Public License v3.0

202 stars 21 forks source link

Recognize function: recognized words with embedded ellipses cause error with segment timeline #74

Open dhouck opened 3 days ago

dhouck commented 3 days ago

First, a disclaimer: Echogarden is being used as a dependency of another project (Storyteller) and Iʼm only 95% sure that the bug is entirely in Echogarden. The specific call site is here in that project. I do not know enough about Echogarden yet to invoke it directly and cut out Storyteller, but Iʼm happy to do so if someone walks me through it.

The speech recognition task sometimes fails with the error message Error: Couldn't find the word '<some...word>' in the text at start position <pos>. Hereʼs an example, with backtrace:

Error: Couldn't find the word 'uh...I' in the text at start position 14110
    at wordTimelineToSegmentSentenceTimeline (file:///app/node_modules/echogarden/dist/utilities/Timelin
e.js:128:23)
    at recognize (file:///app/node_modules/echogarden/dist/api/Recognition.js:184:39)
    at async ...

and each time the word it chokes on has an embedded ellipsis in it, like the uh...I example above.

Given that all the text available at this point is from the audio, this seems like it is saying that it transcribed the word uh...I from the audio, and at the same time it canʼt find it in the transcript, which seems like a bug. My guess, though I donʼt know the project well enough to confirm, is that one part of the process is tokenizing that as two words (eg. uh... and I) and another as one word (uh...I), and then it looks for the latter and canʼt find it.

I am willing to try to find logs or run it directly if asked, but I do not know the project so it might take me a while to figure out how, or where the logs are, or whatever other information you need if this is not enough.

rotemdan commented 3 days ago

The issue occurred because of how the sentence segmentation interacts with the fragments that the recognizer recognizes as individual words. If the recognizer took uh...I as a single word, but the sentence segmenter found a sentence boundary that breaks the word in half, then the error would occur.

It was already reported several times. For example on #67. I knew about it for 6 to 8 months before that.

I already published a fix for it on v1.6.0 (published October 4, 2024).

The fix was complex, and required temporarily masking potential sentence terminators in the recognized fragments when processing for sentences.

Ensure you're using the newest version and test again if you're getting this problem.

Anyway, I'm currently developing a new word, phrase and sentence segmentation library, which should make it easier to prevent these types of issues in general.

dhouck commented 3 days ago

I definitely checked I was using a newer version, 2.0.3. It looks like neither of the changes between that and the latest version are relevant.

I did look through the other issues, but I didnʼt find duplicates; the closest I found was #70, which seems to be failing at a slightly different point. I probably forgot to look through closed issues, but if Iʼd found that I would have asked for it to be re-opened, because the bug is still happening.

Do you have any ideas for workarounds before the new segmentation library is done? I guess I could downgrade Storyteller to a version which still uses @smoores-devʼs branch, which looks like it sidesteps the issue, but for now Iʼve been tweaking model details randomly until I happen to hit on a combination which doesnʼt cause the issue for a particular text (but still might for others).

rotemdan commented 3 days ago

Can you give me an input that reproduces this? It may not necessarily be hard to fix. It could be there's some edge case the fix doesn't handle correctly.

dhouck commented 3 days ago

Probably not until tomorrow evening (I was about to go to bed), but I can try then. Of course the issue is that whisper.cpp is nondeterministic so Iʼm not sure how likely it is the same problem will happen.

dhouck commented 2 days ago

This (Google Drive link because the file is large) is a transcoded version the file I was using at the start of the issue, and generates a similar error to that above. (The original file, which I cut out the opening of because itʼs only needed on chapter 1, is the first one at https://archiveofourown.org/works/24969112/chapters/67835056 if that matters, but I doubt it does for this purpose.) For some reason it now seems to more often read err...I instead of uh...I, but either way it demonstrates the issue.

Hereʼs the command I used to run Echogarden, which I think is identical to what Storyteller does through the API except that I did not provide a custom prompt.

echogarden transcribe /data/assets/audio/fc975685-f49a-4f4c-921d-1454fa72b390/processed/00007-00001.mp4 ch8.txt ch8.srt ch8.json --engine=whisper.cpp --language=en --whisperCpp.enableFlashAttention=true --whisperCpp.model=medium.en --whisperCpp.build=custom --whisperCpp.executablePath=/app/web/whisper-builds/openblas/main |& tee ch8.log

I have attached the log of one run, which in this case shows the uh...I behavior, which is presumably for the line starting at what the transcript says is 13:31 (but… to me seems to be 15:10? Iʼm not sure why reported timestamps would be that far off. But I havenʼt observed any desyncs that big on chapters that complete, so Iʼll assume itʼs something else EDIT: Probably this discrepancy is just the voice activity detection). ch8.log

I tried to cut the audio to just what surrounds this part, but when I did, whisper.cpp stopped transcribing the disfluency entirely and I could not get it to include the uh at all, with any punctuation, so that didnʼt help much. I can try a few things on the next chapter, which showed the same issue but might work better with minimization.

rotemdan commented 2 days ago

Thanks a lot for all the information! I was able to reproduce it when using the exact same parameters.

Turns out I had a simple programming error in the fix code:

In Timeline.ts, method replaceSentenceEndersWithinWordsWithMaskingCharacter I used:

modifiedTranscript =
    transcript.substring(0, wordStartOffset) +
    newWordText +
    transcript.substring(wordEndOffset)

Instead of:

modifiedTranscript =
    modifiedTranscript.substring(0, wordStartOffset) +
    newWordText +
    modifiedTranscript.substring(wordEndOffset)

So it basically masked only a single word in the text - only the last instance, when preprocessing the words before sentence segmentation.

The fix wasn't really masking anything, except for a single word. Once I changed those lines it did work correctly for all instances in the text.

I put many hours of work, almost an entire day, to write this particularly difficult fix, and a silly code mistake made it ineffective.

I applied the change in v2.0.5, which is now released on npm.

Anyway, the way I designed the fix means the last . character in the ellipsis (like in uh...I is not masked. This means that given uh__.I the . character may or may not be interpreted as a sentence ender.

In the new segmentation library, which is already fully working and almost ready to publish, I'll try to see how I can treat ellipsis patterns of this kind in a better way. Right now I don't really have much control over how the segmentation is done, since I'm using an external library (cldr-segmentation), making it difficult to deal with issues like this.

Edit: I forgot to mention, the reason the time offsets reported by whisper.cpp don't match the actual audio, is because the audio is pre-processed to remove any silent sections, before it is passed to whisper.cpp. In the Whisper models, there's a common issue where the model hallucinates when given silent, quiet, or sometimes other forms of non-speech audio, this significantly helps to avoid that.

dhouck commented 2 days ago

I put many hours of work, almost an entire day, to write this particularly difficult fix, and a silly code mistake made it ineffective.

Iʼve been there. And the other side, where I put a whole lot of effort into trying to figure out how to solve an issue, only to realize that the fix is really simple and I was overcomplicating the problem.

Anyway, the way I designed the fix means the last . character in the ellipsis (like in uh...I is not masked. This means that given uh__.I the . character may or may not be interpreted as a sentence ender.

I think Iʼm missing something. When I read the code it looks like it would replace all of them if there are later characters in the word, and since the last character of uh...I is an I all the .s would be replaced (because isLastChar is false for all of them)? It looks like what you said would only apply to uh... on its own, which would be fine, because nothing breaks if uh__. is detected as ending a sentence or if it isnʼt. Did I understand right, or are you saying it would create uh__.I, and if it would, couldnʼt that cause the same issue?

rotemdan commented 2 days ago

Yes, I think it replaces it with uh___I when it's connected (uh...I). I probably meant the case uh... I, where there's a space between them. In that case it would be uh__. I, and the . would likely be taken as a sentence ender.

Anyway, this nightmare-level issue is going to end once I integrate the new segmentation library (likely published as @echogarden/text-segmentation), which would also improve accuracy (larger lexicon for abbreviations on multiple languages and better handling of number patterns and various other patterns and special symbols) and include integrated support for difficult to segment languages like Chinese, Japanese, Thai and Khmer (via second-stage post-processing with a brand new WebAssembly port of the ICU C++ library, which would be published at @echogarden/icu-segmentation-wasm). The east-Asian language integration would work for arbitrary language combination mixes, without needing to pre-specify a language or detect it.