chidiwilliams / buzz

Buzz transcribes and translates audio offline on your personal computer. Powered by OpenAI's Whisper.
https://chidiwilliams.github.io/buzz
MIT License
12.67k stars 950 forks source link

Feature request: Auto-sync: Turn untimed transcript text into captions like YouTube Studio and Descript #693

Open candideu opened 8 months ago

candideu commented 8 months ago

Thanks for the great software! I wanted to share the following feature request:

YouTube allows people to paste or upload a plain text transcript for a video, and it will automatically assign timings for the text to generate timed captions, which can then be exported as .srt, .vtt, etc. This feature is called Auto-Sync. Descript offers a similar syncing feature.

The procedure would be that user has a video or audio with text transcript of the media. The text has no timing information, and the user would like to generate captions/subtitles from that text. So the user provider the media and the text, and the platform automatically adds timing and syncs up the text to the media.

This tutorial provides a good demo of the feature on Descript: https://www.youtube.com/watch?v=dzAexFlyNNI

wzgrx commented 8 months ago

There is currently a lack of a tool that can directly translate srt or txt file subtitles into other languages. I have searched through github, but have not yet found such software

Sircam19 commented 8 months ago

you’re right about for not being a dedicated tool although I have been able to use DeepL. I grab the SRT copy and paste it into DeepL and select the language the target language all of the timings are preserved, but the SRT is now in the target language.

candideu commented 8 months ago

Hi @wzgrx @Sircam19 : This issue isn't about translating captions to another language (which is already possible in YouTube (via Downsub), Memo AI (instructions), etc.)

It's about automatically adding timings to a text transcript that originally had no timings, based on the audio file (in the same language).

Sircam19 commented 8 months ago

Thanks Candideau for clarifying. Was just trying to help. Where is the most apppropriate spot to ask about the most recent update being ported over to Mac. I haven't seen it in the Mac App Store nor with the other releases. Cheers and thanks muchly.

candideu commented 8 months ago

@Sircam19 I would suggest searching existing Issues and Discussions on the relevant topic and adding comments to those, or creating your own Issue or Discussion post.

Good luck!

raivisdejus commented 3 months ago

Since version 1.0.0 Buzz supports translations via AI, for more information see this https://chidiwilliams.github.io/buzz/docs/usage/translations

Regarding adding timings to the existing text, I am doubtful Buzz will integrate this feature any time soon. What is the use case? Transcription in your language has poor quality and that is the reason you would like to use existing test?

Currently you can just ignore the existing test you may have and generate new subtitles from audio or video. So curious why this scenario does not suit you and why you would like to use existing text?

candideu commented 3 months ago

Regarding adding timings to the existing text, I am doubtful Buzz will integrate this feature any time soon. What is the use case? Transcription in your language has poor quality and that is the reason you would like to use existing test?

Currently you can just ignore the existing test you may have and generate new subtitles from audio or video. So curious why this scenario does not suit you and why you would like to use existing text?

@raivisdejus If I write up a script, and then read that script for a video (because I memorized it, or because I was using a teleprompter), it's a lot more efficient to add timed captions to that existing text, than to generate new captions which won't be 100% accurate (because misspelled names, acronyms, punctuation that is off, homophones, etc.). It saves the hassle of having to go over the generated captions to correct mistakes. I've used Whisper in both French and English, two well-supported languages, and it never gets it 100% right.

I've had to do this in the past for the following kinds of footage:

  1. Live speeches
  2. Recorded presentations and lectures
  3. Scripted fiction (TV episodes, films -- anything with an existing script)
  4. News anchor segments
  5. Livestreamed intros

YouTube and Descript are the only tools that have that feature that I've found, but they are proprietary tools, and require the user to upload their content to their servers. There are certain industries where sensitive information can only be processed locally, hence making a FOSS tool like Buzz important.