fortypercnt / stream-translator

MIT License
220 stars 44 forks source link

Your Repo is one of the best at real time translation #5

Closed codenan42 closed 1 year ago

codenan42 commented 1 year ago

Would you mind updating this with using faster whisper repo maybe like https://github.com/guillaumekln/faster-whisper

or maybe the use of any huggingface models https://huggingface.co/models?other=whisper similar to https://github.com/chidiwilliams/buzz/issues implementation. Maybe buzz also using a different repo not the openai one because it seems faster than the original. Maybe you want to look into that.

Of course it just suggestions. Your repo is already working fine as it is. Thank you

fortypercnt commented 1 year ago

Thanks for the kind words and the suggestions! I went ahead and added an option to use faster-whisper instead of the OpenAI implementation. The faster-whisper implementation does not offer the same kind of prompt prefixing feature that I used to implement the history buffer when using OpenAI's version, so I ignored that feature for now. I have only done limited testing but it seems great, very noticeable upgrades in inference time and VRAM usage, and the translation quality doesn't seem too different. Please try it out and tell me what you think!

codenan42 commented 1 year ago

Hi oh wow thank you for the fast update! This is great!

I ran everything on google colab so my PC doesn't have to do the heavy work while watching a stream live and it works wonderfully. The translation is almost instant, sometimes it takes a bit more time because it stiches and translated a really long sentences and it came out really quick still. I tried it on a stream with more than one speaker and it still doing well to catch up on the stream.

I only use the standard flag --language -task translate --use_faster_whisper and --faster_whisper_model_path

Below is the colab memory consumption while running a medium model with the new fast-whisper addition using your repo

image

With the addition of fast-whisper now I can pretty much convert any openai-whisper model out there that has been fine-tuned to a certain language which is nice. Although with colab i do have problem converting any large model i guess 12GB ram is not sufficient to convert the model to CTranslate2 format. But that is not your repo issue.

Other than that pretty much nothing else to add. The only obvious problem now is probably what they call "Whisper hallucination" where it sometimes show lines like "Thank you for watching!" and repeat it multiple times (getting a bit frequent too) which is something to do with the original openai model i guess. I think it only happens in non english model.

Suggestion: Would it be possible to have a negative prompt flag? since we can pretty much guess that the model will sometime spam the line "Thank you for watching!" or "Thank you for Subscribing" out of no where. Instead we can input this negative prompt first and prevent the script from showing it. I don't know if it even necessary or will make things better probably worse. Its not a big problem anyway :)

You really did a great job with the new addition I'm glad I can run it on colab free tier and now use it to watch pretty much any live stream that is not english and don't feel left out.

fortypercnt commented 1 year ago

Thanks for your detailed response! I'm happy that the repo is useful to others. For non English languages I would definitely try to get the large model working if possible. Maybe you can upload it to google drive and load it into colab from there? But idk I don't use colab for this.

The hallucination think is very annoying, I added the additional voice activity detection to counteract it. You could try setting the default threshold for the VAD to something higher than 0.5 here. I didn't spend time optimizing any of these parameters.

Another thing that leads to repetitions and also hiccups / hanging when the translation is difficult / unreliable is the temperature decoding fallback implemented in both models. I didn't expose it in the CLI but you could try setting temperature=0 in the call to model.transcribe(...), that prevents retrying "failed" decodes and hanging. I also didn't spend much time optimizing these decoding options and just trusted the people who implemented the models to have chosen sensible default values. They might not be optimal for livestream decoding with small windows though but idk. I'd be happy for anyone who experiments with these settings to share their findings.

Unfortunately the whole whisper architecture is not really built for live / real time decoding. I think the best solution for live decoding would be using sliding windows and updating older tokens based on new info, instead of the simple disjointed window method I implemented. Updating tokens does require fiddling with the models internals though, and I'm not really looking to go into that. If you come across someone implementing this, I'd be glad to steal their code. 😁

As for negative prompts, I only know them from image generators like stable diffusion. Can you link me something about using them with whisper or was it just a random idea? Implementing something like that from scratch is way above my paygrade. 😅

codenan42 commented 1 year ago

Thanks for the suggestions on the VAD. I've tried increasing the threshold and it does help reducing the repetitions. I also load large model i converted from my pc through gdrive and it ran just fine on colab.

I've tried all kinds of whisper projects and yeah this technology is just not made for real time translation. I have yet to see anyone made any great break through that is any better than the one your repo is using now.

Regarding negative prompts, it was just a random idea that I thought could be helpful. Yeah i got it from SD :p I'm not sure if it's even possible to implement with whisper. I'll look into it if there is any and let you know if I find anything new stuff that is useful for this repo :)

Thanks again for putting together this repo and make it work! It's made a big difference in my ability to watch livestreams in other languages.