[Feature Request] Live Camera Input

nitchevcasseus commented 2 months ago

The ability to use my webcam as an input replacement for Live streaming URLs. Users can use it for Real-Time translation for their actual Live Stream.

antor44 commented 2 months ago

Hi again! I'm always keen to add more features to the app. Although I'm not sure if you mean for sending or receiving video, after doing some research, I realized that adding support for webcams might be easier than I thought, especially for Linux. The major challenge is that the code needed could vary depending on the webcam, or sending translated text through video might be a bit challenging. Within Python, the OpenCV library is used to add transcribed text to video, and adding webcam sound to transcribe it should be easier since it's similar to using a live mic, with known codes. However, using only the bash script is my priority. It's easy to add sound transcriptions with ffmpeg in the bash scripts. However, adding text to video using the bash script is different. I'll need to do some more research on this.

nitchevcasseus commented 2 months ago

No just sound input to text to speech output is fine. I was thinking users can basically have a real time translation app.

For example like: META Seamless Streaming - DEMO VIDEO

antor44 commented 2 months ago

Maybe it's not so difficult to transcribe or translate a sound stream, or to perform text-to-speech for input sound that another person sends. To make this work, the same app needs to be running on each computer whenever the conference app is being used. However, for other scenarios, there are many other problems that I believe extend beyond the scope of this application. Firstly, to send anything outside to the internet, a PC requires a server-client system, with either a P2P or a dedicated server for users. Other issues include the vast variety of webcams and sound systems. I am only able to test a very few. Additionally, OpenAI's Whisper AI is the best transcription AI, but the current version has significant issues such as hallucinations. Apart from this, the text-to-speech feature and translation to languages other than English are performed via the internet, thanks to the Translate-shell app, which utilizes a free Google service. However, the availability of this service is not guaranteed and the text-to-speech feature only works for chunks of a few seconds. Achieving the same with a local AI is not as effective; local AIs require powerful hardware beyond that of a typical computer. Nonetheless, I believe transcribing the input sound from another person could suffice, and I need to further research this option. However, before that, I must integrate a VAD system, for which I have encountered some difficulties.

nitchevcasseus commented 2 months ago

For the VAD system, maybe you can find a little bit of inspiration from this repo: https://github.com/BasedHardware/Friend

To be able to stream audio 24 hours+ I'm sure there might be a clue that helps you find a breakthrough!

I'll see what else I can find.

antor44 commented 2 months ago

Thanks for the suggestion. I had several issues, not only with the VAD system itself that I was testing, which wasn't the best. The repo uses Silk VAD, a low-level VAD code developed by Skype. It's probably considered really good and it has an open-source license. Anyway, I'm more focused on high-level programming with bash, or at least with Python. I found Silero VAD, an easy solution based on AI and open source, and maybe it's one of the best out there: Silero VAD

nitchevcasseus commented 2 months ago

I also found these: They're really good for near instant transcribing and text to speech.

RealtimeTTS

RealtimeSTT

They're in Python, ill keep looking around to see if I find anything else that can help you out. The STT actually uses Silero VAD as part of the libraries.

antor44 commented 2 months ago

The RealtimeSTT program performs the same function as what I'm currently working on: recording from both the microphone and "what you hear", meaning it records the audio output from the speakers, which I believe is what you requested. This is achieved using loopback audio, and it's necessary to configure the sound card within the operating system settings. Any recording application should be capable of the same functionality, such as speech-to-text (STT) apps that transcribe audio from speakers. RealtimeSTT is based on Faster-Whisper, which is the same I've been working with. This implementation of Whisper AI integrates Silero VAD (Voice Activity Detection). This Whisper implementation differs from the one I'm currently using; in some aspects, it's better, while in others, it's not as good. However, I'll need to integrate both versions. This is not my preference; I believe it may make the program too complicated, or I may need to approach it differently. Additionally, Faster-Whisper and Silero VAD will require additional installations, such as libraries and other AI models.

Regarding RealtimeTTS, that app is based on online text-to-speech services. I don't think it's better than translate-shell, the app used by Playlist4whisper.

antor44 / livestream_video

[Feature Request] Live Camera Input #2