Problem for App Store edition

WeCanSee commented 1 week ago

I am using Version 1.0.2(137) and MacOS 14.6(23G5052d). I uploaded an mp3 file,about 44'30",45mb.I used the large model with CoreML. When the transcribe function was over,I found some errors in the generated SRT file:

1.The timeline of subtitle file was wrong which only to the second level,not the millisecond level.So I can't use the file directly,I had to adjust the timeline mannually; WX20240701-204927@2x

2.There are 3 paragraphs that have not been transcribed correctly,only a large amount of repetitive text is present in these paragraphs in the SRT file.This error occurred three times in this file.No matter how I rerun the transcribe program, the result is the same.7 minutes of audio not transcribed correctly in total in this file.

raivisdejus commented 6 days ago

If you need millisecond precision AI models Like whisper will not be able to get the precision you need. This is a limitation of how they are built and there is nothing we can do about it. Most likely none of AI tools will be able to get milisecond precision right.

When I needed millisecond precision, I used https://github.com/echogarden-project/echogarden It uses different algorithm to align the text to audio and precision is much better. I did prepare a text file with one sentence per line and used "forced alignment" feature of the echogarden, result was quite good. In general for millisecond precision you need some "forced alignment" tool. Some other options are here https://github.com/topics/forced-alignment

Regarding errors in the transcript, try large-v2 model or large-v3 those may improve precision. Try "Faster Whisper" it will use large-v2 model when you transcribe with "large" model selected.

WeCanSee commented 6 days ago

If you need millisecond precision AI models Like whisper will not be able to get the precision you need. This is a limitation of how they are built and there is nothing we can do about it. Most likely none of AI tools will be able to get milisecond precision right.

When I needed millisecond precision, I used https://github.com/echogarden-project/echogarden It uses different algorithm to align the text to audio and precision is much better. I did prepare a text file with one sentence per line and used "forced alignment" feature of the echogarden, result was quite good. In general for millisecond precision you need some "forced alignment" tool. Some other options are here https://github.com/topics/forced-alignment

Regarding errors in the transcript, try large-v2 model or large-v3 those may improve precision. Try "Faster Whisper" it will use large-v2 model when you transcribe with "large" model selected.

Oh thank you for providing such a detailed answer. A little more question, I use an appstore version, so how can I use large-v3 model in Buzz Captain app? Can I just download the large-v3 model file, and use it in Buzz Captain App?

raivisdejus commented 6 days ago

On existing App store version you may be able to use Huggingface Whisper type with openai/whisper-large-v3 as model to use. This Whisper does not provide option for word level timestamps, but regular speech recognition should work.

Please note that large-v3 model can hallucinate or recognize words that are not in the speech. In this aspect openai/whisper-large-v2 may be better as it does not seem to have this problem.

Alternatively you can try the latest development version from some Action. Log into the GitHub and look at the bottom of the Action page f.e. here https://github.com/chidiwilliams/buzz/actions/runs/9656460007

chidiwilliams / buzz

Problem for App Store edition #818