Whisper.cpp parseResultObject failure/edge case

echogarden-project / echogarden

Easy-to-use speech toolset. Written in TypeScript. Includes tools for synthesis, recognition, alignment, speech translation, language detection, source separation and more.

GNU General Public License v3.0

171 stars 17 forks source link

Whisper.cpp parseResultObject failure/edge case #65

Open smoores-dev opened 1 month ago

smoores-dev commented 1 month ago

When recognizing some border-line pathological audio content, apparently Whisper.cpp sometimes will output tokens without offset properties, resulting in the following error:

await recognize(`./00000-00009.mp4`, {engine: "whisper.cpp", language: "en", whisper: {model: "tiny.en", build: "cpu"}})
Uncaught TypeError: Cannot read properties of undefined (reading 'from')
    at parseResultObject (file:///home/smoores/code/storyteller/node_modules/echogarden/dist/recognition/WhisperCppSTT.js:187:49)

Here's the audio asset in question:

https://github.com/user-attachments/assets/68a6bac2-6461-4787-8943-821f6c5d0311

It's a TTS narration of a passage that includes, at a few points, the phrase "I love you" several dozen times in a row.

rotemdan commented 1 month ago

The lines possibly related are:

if (tokenIndex === 0 && tokenObject.text === '[_BEG_]' && tokenObject.offsets.from === 0) {
    currentCorrectionTimeOffset = segmentObject.offsets.from / 1000
}

and

startTime = tokenObject.offsets.from / 1000
endTime = tokenObject.offsets.to / 1000

The code makes the assumption that offsets.from and offsets.to are always available.

Anyway, the whisper.cpp build used by default has become slightly outdated now (early April 2024). Can you try with a newer whisper.cpp build (v1.6.0 seems to be the latest published with actual binaries) to see if the problem was maybe fixed since then?

You can set a custom main executable with whisperCpp.executablePath.

If that doesn't help, I'll see how I can workaround the issue to prevent the error.

smoores-dev commented 1 month ago

Yeah, this is unfortunately happening even when building directly from HEAD on the master branch of the whisper.cpp repo! I just ran the whisper.cpp command with the same flags as echogarden and found the problem token; at the end of the first very long string of "I love you"s, the last "you" token looks like this:

{
    "text": " you",
    "id": 291,
    "p": 0.960787,
    "t_dtw": -1
}

It has neither timestamps nor offsets!

rotemdan commented 1 month ago

Thanks a lot for the investigation.

I guess the issue can be reported on the whisper.cpp repository, if it hasn't already.

For now, I can work around the issue by filling in missing timestamps based on neighboring timestamps.

I'm not doing development of this package at this general time (busy with other things), so I can't really predict exactly when the workaround would be published (maybe a few weeks, I don't know).

smoores-dev commented 1 month ago

Yeah I'll open an issue against whisper.cpp as well; hopefully they'll fix it on their end! Thanks for taking a look

smoores-dev commented 1 month ago

Would it be easier if I were to open a PR that attempted to work around this as you described, by looking at the timestamps/offsets of the surrounding tokens? I know that PR review can also be quite a bit of work, so no worries if you'd rather handle it yourself! I was just reminded of the monstrous number of open issues against the whisper.cpp repo haha

rotemdan commented 1 month ago

I don't think I need or want pull requests (so far I've closed the two that I got). This has been a personal project of mine. Maybe I'd prefer to keep the code 100% my own for now.

Even if I get the code, I can't guarantee when it is going to be published since I have other partially committed code destined for the next release.

Also, testing it works correctly may take more time than actually writing the code.

So, no need for pull request. I can try to quickly write and test a workaround locally, but it's not likely to be published during the next week (or possibly a bit more than).

smoores-dev commented 1 month ago

Understood, sounds good!