cyberofficial / Synthalingua

Synthalingua - Real Time Translation
http://synthalingua.site/
GNU Affero General Public License v3.0
221 stars 16 forks source link
ai audio gui python real-time-translation realtime speech-recognition speech-to-text transcription translation visual-basic

Synthalingua

Wiki

Read the wiki here

About

Synthalingua is an advanced, self-hosted tool that leverages the power of artificial intelligence to translate audio from various languages into English in near real time, offering the possibility of multilingual outputs. This innovative solution utilizes both GPU and CPU resources to handle the input transcription and translation, ensuring optimized performance. Although it is currently in beta and not perfect, Synthalingua is actively being developed and will receive regular updates to further enhance its capabilities.

Developed Proudly in PyCharm IDE from JetBrains

JetBrains kindly approved me for an OSS licenses for their software for use of this project. This will grealty improve my production rate.

Learn about it here: https://jb.gg/OpenSourceSupport

You can grab the GUI from the Releases section on github.

Grab the portable version on itch!

Badges

CodeQL

Readme will update as time goes. This is a work in progress.

Table of Contents

Table of Contents Description
Disclaimer Things to know/Disclaimers/Warnings/etc
To Do List Things to do
Contributors People who helped with the project or contributed to the project.
Installing/Setup How to install and setup the tool.
Misc Usage and File Arguments - Examples - Web Server
Troubleshooting Common issues and how to fix them.
Additional Info Additional information about the tool.
Video Demos Video demonstrations of the tool.
Extra Notes Extra notes about the tool.

Things to know/Disclaimers/Warnings/etc

TODO

Todo Sub-Task Status
Add support for AMD GPUs. ROCm support - WSL 2.0/Linux Only
OpenCL support - Linux Only
Add support API access.
Custom localhost web server.
Add reverse translation.
Localize script to other languages. (Will take place after reverse translations.)
Custom dictionary support.
GUI.
Sub Title Creation
Linux support.
Improve performance.
Compressed Model Format for lower ram users
Better large model loading speed
Split model up into multiple chunks based on usage
Stream Audio from URL
Increase model swapping accuracy.
No Microphone Required Streaming Module
Server Control Panel Currently under work, will come out in a future release. I've want to get this out soon as possible, but I've been running into road blocks. This is a higher prio feature, please keep an eye out for a future dev blog on more details and previews! 🚧

Contributors

Guidelines

@DaniruKun - https://watsonindustries.live

@Expletive - https://evitelpxe.neocities.org

@Adenser

System Requirements

Supported GPUs Description
Nvidia Dedicated Graphics Supported
Nvidia Integrated Graphics Tested - Not Supported
AMD/ATI * Linux Verified
Intel Arc Not Supported
Intel HD Not Supported
Intel iGPU Not Supported

GUI Portable Version (not the CLI portable)

You can find full list of supported Nvida GPUs here:

Requirement Minimum Moderate Recommended Best Performance
CPU Cores 2 6 8 16
CPU Clock Speed (GHz) 2.5 or higher 3.0 or higher 3.5 or higher 4.0 or higher
RAM (GB) 4 or higher 8 or higher 16 or higher 16 or higher
GPU VRAM (GB) 2 or higher 6 or higher 8 or higher 12 or higher
Free Disk Space (GB) 10 or higher 10 or higher 10 or higher 10 or higher
GPU (suggested) As long as the gpu you have is within vram spec, it should work fine. Nvidia GTX 1050 or higher Nvidia GTX 1660 or higher Nvidia RTX 3070 or higher Nvidia RTX 3090 or higher

Note:

The tool will work on any system that meets the minimum requirements. The tool will work better on systems that meet the recommended requirements. The tool will work best on systems that meet the best performance requirements. You can mix and match the requirements to get the best performance. For example, you can have a CPU that meets the best performance requirements and a GPU that meets the moderate requirements. The tool will work best on systems that meet the best performance requirements.

A microphone is optional. You can use the --stream flag to stream audio from a HLS stream. See Examples for more information.

You'll need some sort of software input source (or hardware source). See issue #63 for additional information.

Installation

  1. Download and install Python 3.10.9.
    • Make sure to check the box that says "Add Python to PATH" when installing. If you don't check the box, you will have to manually add Python to your PATH. You can check this guide: How to add Python to PATH.
    • You can choose any python version that is 3.10.9 up to the latest version. The tool will not work on any python version that is 3.11 or higher. Must be 3.10.9+ not 3.11.x.
    • Make sure to grab the x64 bit version! This program is not compatible with x86. (32bit)
  2. Download and install Git.
    • Using default settings is fine.
  3. Download and install FFMPEG
  4. Download and install CUDA [Optional, but needs to be installed if using GPU]
  5. Run setup script
    • On Windows: setup.bat
    • On Linux: setup.bash
      • Please ensure you have gcc installed and portaudio19-dev installed (or portaudio-devel for some machines`)
    • If you get an error saying "Setup.bat is not recognized as an internal or external command, operable program or batch file.", houston we have a problem. This will require you to fix your operating system.
  6. Run the newly created batch file/bash script. You can edit that file to change the settings.
    • If you get an error saying it is "not recognized as an internal or external command, operable program or batch file.", make sure you have installed and added to your PATH, and make sure you have git installed. If you have python and git installed and added to your PATH, then create a new issue on the repo and I will try to help you fix the issue.

Usage

This script uses argparse to accept command line arguments. The following options are available: Flag Description
--ram Change the amount of RAM to use. Default is 4GB. Choices are "1GB", "2GB", "4GB", "6GB", "12GB".
--ramforce Use this flag to force the script to use desired VRAM. May cause the script to crash if there is not enough VRAM available.
--fp16 This allows for more accurate information being passed to the process. This will grant the AL the ability to process more information at the cost of speed. You will not see heavy impact on stronger hardware. Combine 12gb-v3 + fp16 Flags (Precision Mode on the GUI) for the ultimate experience.
--energy_threshold Set the energy level for microphone to detect. Default is 100. Choose from 1 to 1000; anything higher will be harder to trigger the audio detection.
--mic_calibration_time How long to calibrate the mic for in seconds. To skip user input type 0 and time will be set to 5 seconds.
--record_timeout Set the time in seconds for real-time recording. Default is 2 seconds.
--phrase_timeout Set the time in seconds for empty space between recordings before considering it a new line in the transcription. Default is 1 second.
--translate Translate the transcriptions to English. Enables translation.
--transcribe Transcribe the audio to a set target language. Target Language flag is required.
--target_language Select the language to translate to. Available choices are a list of languages in ISO 639-1 format, as well as their English names.
--language Select the language to translate from. Available choices are a list of languages in ISO 639-1 format, as well as their English names.
--auto_model_swap Automatically swap the model based on the detected language. Enables automatic model swapping.
--device Select the device to use for the model. Default is "cuda" if available. Available options are "cpu" and "cuda". When setting to CPU you can choose any RAM size as long as you have enough RAM. The CPU option is optimized for multi-threading, so if you have like 16 cores, 32 threads, you can see good results.
--cuda_device Select the CUDA device to use for the model. Default is 0.
--discord_webhook Set the Discord webhook to send the transcription to.
--list_microphones List available microphones and exit.
--set_microphone Set the default microphone to use. You can set the name or its ID number from the list.
--microphone_enabled Enables microphone usage. Add true after the flag.
--auto_language_lock Automatically lock the language based on the detected language after 5 detections. Enables automatic language locking. Will help reduce latency. Use this flag if you are using non-English and if you do not know the current spoken language.
--model_dir Default location is "model" folder. You can use this argument to change location.
--use_finetune Use fine-tuned model. This will increase accuracy, but will also increase latency. Additional VRAM/RAM usage is required. ⚠️ Fine Tune model is being retrained. Command flag is useless in current code.
--no_log Makes it so only the last thing translated/transcribed is shown rather log style list.
--updatebranch Check which branch from the repo to check for updates. Default is master, choices are master and dev-testing and bleeding-under-work. To turn off update checks use disable. bleeding-under-work is basically latest changes and can break at any time.
--keep_temp Keeps audio files in the out folder. This will take up space over time though.
--portnumber Set the port number for the web server. If no number is set then the web server will not start.
--retry Retries translations and transcription if they fail.
--about Shows about the app.
--save_transcript Saves the transcript to a text file.
--save_folder Set the folder to save the transcript to.
--stream Stream audio from a HLS stream.
--stream_language Language of the stream. Default is English.
--stream_target_language Language to translate the stream to. Default is English. Needed for --stream_transcribe
--stream_translate Translate the stream.
--stream_transcribe Transcribe the stream to different language. Use --stream_target_language to change the output.
--stream_original_text Show the detected original text.
--stream_chunks How many chunks to split the stream into. Default is 5 is recommended to be between 3 and 5. YouTube streams should be 1 or 2, twitch should be 5 to 10. The higher the number, the more accurate, but also the slower and delayed the stream translation and transcription will be.
--cookies Cookies file name, just like twitch, youtube, twitchacc1, twitchacczed
--makecaptions Set program to captions mode, requires file_input, file_output, file_output_name
--file_input Location of file for the input to make captions for, almost all video/audio format supported (uses ffmpeg)
--file_output Location of folder to export the captions
--file_output_name File name to export as without any ext.
--ignorelist Usage is "--ignorelist "C:\quoted\path\to\wordlist.txt""
--condition_on_previous_text Will help the model from repeating itself, but may slow up the process.
--remote_hls_password_id Password ID for the webserver. Usually like 'id', or 'key'. Key is default for the program though, so when it asks for id/password, Synthalingua will be key=000000 - key=id - 0000000=password 16 chars long.
--remote_hls_password Password for the hls webserver.

Things to note!

Word Block List

With the flag --ignorelist you can now load a list of phrases or words to ignore in the api output and subtitle window. This list is already filled with common phrases the AI will think it heard. You can adjust this list as youu please or add more words or phrases to it.

Cookies

Some streams may require cookies set, you'll need to save cookies as netscape format into the cookies folder as a .txt file. If a folder doesn't exist, create it. You can save cookies using this https://cookie-editor.com/ or any other cookie editor, but it must be in netscape format.

Example usage --cookies twitchacc1 DO NOT include the .txt file extension.

What ever you named the text file in the cookies folder, you'll need to use that name as the argument.

Web Server

With the command flag --port 4000, you can use query parameters like ?showoriginal, ?showtranslation, and ?showtranscription to show specific elements. If any other query parameter is used or no query parameters are specified, all elements will be shown by default. You can choose another number other than 4000 if you want. You can mix the query parameters to show specific elements, leave blank to show all elements.

For example:

Examples

Please note, make sure you edit the livetranslation.bat/livetranslation.bash file to change the settings. If you do not, it will use the default settings.

This will create captions, with the 12gb option and save to downloads.

PLEASE NOTE, CAPTIONS WILL ONLY BE IN ENGLISH (Model limitation) THOUGH YOU CAN ALWAYS USE OTHER PROGRAMS TO TRANSLATE INTO OTHER LANGUAGES

python transcribe_audio.py --ram 12gb --makecaptions --file_input="C:\Users\username\Downloads\430796208_935901281333537_8407224487814569343_n.mp4" --file_output="C:\Users\username\Downloads" --file_output_name="430796208_935901281333537_8407224487814569343_n" --language Japanese --device cuda

You have a 12gb GPU and want to stream the audio from a live stream https://www.twitch.tv/somestreamerhere and want to translate it to English. You can run the following command:

python transcribe_audio.py --ram 12gb --stream_translate --stream_language Japanese --stream https://www.twitch.tv/somestreamerhere

Stream Sources from YouTube and Twitch are supported. You can also use any other stream source that supports HLS/m3u8.

You have a GPU with 6GB of memory and you want to use the Japanese model. You also want to translate the transcription to English. You also want to send the transcription to a Discord channel. You also want to set the energy threshold to 300. You can run the following command:

python transcribe_audio.py --ram 6gb --translate --language ja --discord_webhook "https://discord.com/api/webhooks/1234567890/1234567890" --energy_threshold 300

When choosing ram, you can only choose 1gb, 2gb, 4gb, 6gb, 12gb. There are no in-betweens.

You have a 12gb GPU and you want to translate to Spanish from English, you can run the following command:

python transcribe_audio.py --ram 12gb --transcribe --target_language Spanish --language en

Lets say you have multiple audio devices and you want to use the one that is not the default. You can run the following command: python transcribe_audio.py --list_microphones This command will list all audio devices and their index. You can then use the index to set the default audio device. For example, if you want to use the second audio device, you can run the following command: python transcribe_audio.py --set_microphone "Realtek Audio (2- High Definiti" to set the device to listen to. *Please note the quotes around the device name. This is required to prevent errors. Some names may be cut off, copy exactly what is in the quotes of the listed devices.

Example lets say I have these devices:

Microphone with name "Microsoft Sound Mapper - Input" found, the device index is 1
Microphone with name "VoiceMeeter VAIO3 Output (VB-Au" found, the device index is 2
Microphone with name "Headset (B01)" found, the device index is 3
Microphone with name "Microphone (Realtek USB2.0 Audi" found, the device index is 4
Microphone with name "Microphone (NVIDIA Broadcast)" found, the device index is 5

I would put python transcribe_audio.py --set_microphone "Microphone (Realtek USB2.0 Audi" to set the device to listen to. -or- I would put python transcribe_audio.py --set_microphone 4 to set the device to listen to.

Troubleshooting

If you encounter any issues with the tool, here are some common problems and their solutions:

Additional Information

Video Demonstration

Command line arguments used. --ram 6gb --record_timeout 2 --language ja --energy_threshold 500

Command line arguments used. --ram 12gb --record_timeout 5 --language id --energy_threshold 500