A command-line tool to transform voices in (YouTube) videos with additional translation capabilities. [^1]
Hint: Anybody interested in state-of-the-art voice solutions please also have a look at Linguflex. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.
https://github.com/KoljaB/TurnVoice/assets/7604638/e0d9071c-0670-44bd-a6d5-4800e9f6190c
Voice Transformation
Turn voices with the free Coqui TTS at no operating costs (supports voice cloning, 58 voices included
Voice Variety
Support for popular TTS engines like Elevenlabs, OpenAI TTS, or Azure for more voices. [^7]
Translation
Translates videos at zero costs, for example from english to chinese. powered by free deep-translator
Change Speaking Styles (AI powered)
Make every spoken sentence delivered in a custom speaking style for a unique flair using prompting. [^6]
Full Rendering Control
Precise rendering control by customizing the sentence text, timings, and voice selection.
💡 Tip: the Renderscript Editor makes this step easy
Local Video Processing
Process any local video files.
Background Audio Preservation
Keeps the original background audio intact.
Discover more in the release notes.
Nvidia graphic card >8 GB VRAM recommended, tested on Python 3.11.4 / Windows 10.
NVIDIA CUDA Toolkit 12.1 installed
ffmpeg command-line utility installed [^3]
[!TIP] Set your HF token with `setx HF_ACCESS_TOKEN "your_token_here"
pip install turnvoice
[!TIP] For faster rendering with GPU prepare your CUDA environment after installation:
For CUDA 12.1
pip install torch==2.3.1+cu211 torchaudio==2.3.1+cu211 --index-url https://download.pytorch.org/whl/cu211
Rendering time is high even with a strong GPU, therefore while it might be possible it is not recommended to run this script on CPU only.
Note: Do not use torch versions >= 2.4 together with cuNN 9.0 because faster_whisper (CTranslate2) does not support this combination yet.
turnvoice [-i] <YouTube URL|ID|Local File> [-l] <Translation Language> -e <Engine(s)> -v <Voice(s)> -o <Output File>
Submit a string to the 'voice' parameter for each speaker voice you wish to use. If you specify engines, the voices will be assigned to these engines in the order they are listed. Should there be more voices than engines, the first engine will be used for the excess voices. In the absence of a specified engine, the Coqui engine will be used as the default. If no voices are defined, a default voice will be selected for each engine.
Arthur Morgan narrating a cooking tutorial:
turnvoice -i AmC9SmCBUj4 -v arthur.wav -o cooking_with_arthur.mp4
[!NOTE] Requires the cloning voice file (e.g., arthur.wav or .json) in the same directory (you find one in the tests directory).
Prepare a script with transcription, speaker diarization (and optionally translation or prompting) using:
turnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c --prepare
Translation and prompts should be applied in this preparation step. Engines or voices come later in the render step.
Render the refined script to generate the final video using:
turnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c --render <path_to_script>
Adjust the path in the displayed CLI command (the editor can't read that information out from the browser).
Assign engines and voices to each speaker track with the -e and -v commands.
-i
, --in
: Input video. Accepts a YouTube video URL or ID, or a path to a local video file.-l
, --language
: Language for translation. Coqui synthesis supports: en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh, ja, hu, ko. Omit to retain the original video language.-il
, --input_language
: Language code for transcription, set if automatic detection fails.-v
, --voice
: Voices for synthesis. Accepts multiple values to replace more than one speaker.-o
, --output_video
: Filename for the final output video (default: 'final_cut.mp4').-a
, --analysis
: Print transcription and speaker analysis without synthesizing or rendering the video.-from
: Time to start processing the video from.-to
: Time to stop processing the video at.-e
, --engine
: Engine(s) to synthesize with. Can be coqui, elevenlabs, azure, openai or system. Accepts multiple values, linked to the the submitted voices. -s
, --speaker
: Speaker number to be transformed.-snum
, --num_speakers
: Helps diarization. Specify the exact number of speakers in the video if you know it in advance. -smin
, --min_speakers
: Helps diarization. Specify the minimum number of speakers in the video if you know it in advance. -smax
, --max_speakers
: Helps diarization. Specify the maximum number of speakers in the video if you know it in advance. -dd
, --download_directory
: Directory for saving downloaded files (default: 'downloads').-sd
, --synthesis_directory
: Directory for saving synthesized audio files (default: 'synthesis').-ex
, --extract
: Enables extraction of audio from the video file. Otherwise downloads audio from the internet (default).-c
, --clean_audio
: Removes original audio from the final video, resulting in clean synthesis.-tf
, --timefile
: Define timestamp file(s) for processing (functions like multiple --from/--to commands).-p
, --prompt
: Define a prompt to apply a style change to sentences like "speaking style of captain jack sparrow" [^6]-prep
, --prepare
: Write full script with speaker analysis, sentence transformation and translation but doesn't perform synthesis or rendering. Can be continued.-r
, --render
: Takes a full script and only perform synthesis and rendering on it, but no speaker analysis, sentence transformation or translation. -faster
, --use_faster
: Usage of faster_whisper for transcription. If stable_whisper transcription throws OOM errors or delivers suboptimal results. (Optional)-model
, --model
: Transcription model to be used. Defaults to large-v2. Can be 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2', 'large-v3', or 'large'. (Optional)
-i
and-l
can be used as both positional and optional arguments.
Translate a video into another language using the -l parameter.
For example, to translate into chinese you could use:
turnvoice https://www.youtube.com/watch?v=ZTH771HIhpg -l zh-CN -v daisy
Output Video
💡 Tip: In the tests folder you find a voice "chinese.json" trained on chinese phonemes.
Coqui engine is the default engine if no other engine is specified with the -e parameter.
[!NOTE] To use Elevenlabs voices you need the API Key stored in env variable ELEVENLABS_API_KEY
All voices are synthesized with the multilingual-v1 model.
[!CAUTION] Elevenlabs is a pricy API. Focus on short videos. Don't let a work-in-progress script like this run unattended on a pay-per-use API. Bugs could be very annoying when occurring at the end of a pricy long rendering process.
[!TIP] Test rendering with a free engine like coqui first before using pricy ones.
[!NOTE] To use OpenAI TTS voices you need the API Key stored in env variable OPENAI_API_KEY
[!NOTE] To use Azure voices you need the API Key for SpeechService resource in AZURE_SPEECH_KEY and the region identifier in AZURE_SPEECH_REGION
If you run into "Could not locate cudnn_ops_infer64_8.dll", this is caused by faster_whisper not supporing the combination of cuDNN version greater than 9 and PyTorch version greater than 2.4.
To solve:
First perform a speaker analysis with -a parameter:
turnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM -a
Then select a speaker from the list with -s parameter
turnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM -s 2
TurnVoice is proudly under the Coqui Public Model License 1.0.0.
Share your funniest or most creative TurnVoice creations with me!
And if you've got a cool feature idea or just want to say hi, drop me a line on
If you like the repo please leave a star
✨ 🌟 ✨
[^1]: State is work-in-progress (early pre-alpha). Ülease expect CLI API changes to come and sorry in advance if anything does not work as expected.
Developed on Python 3.11.4 under Win 10.
[^2]: Rubberband is needed to pitchpreserve timestretch audios for fitting synthesis into timewindow.
[^3]: ffmpeg is needed to convert mp3 files into wav
[^4]: Huggingface access token is needed to download the speaker diarization model for identifying speakers with pyannote.audio.
[^5]: Speaker diarization is performed with the pyannote.audio default HF implementation on the vocals track splitted from the original audio.
[^6]: Generates costs. Uses gpt-4-1106-preview model and needs OpenAI API Key stored in env variable OPENAI_API_KEY.
[^7]: Generates costs. Elevenlabs is pricy, OpenAI TTS, Azure are affordable. Needs API Keys stored in env variables, see engine information for details.