KoljaB / TurnVoice

Voice Transformation for Videos. 🎤👄🎬
216 stars 21 forks source link
translation voice youtube

TurnVoice

A command-line tool to transform voices in (YouTube) videos with additional translation capabilities. [^1]

Hint: Anybody interested in state-of-the-art voice solutions please also have a look at Linguflex. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.

https://github.com/KoljaB/TurnVoice/assets/7604638/e0d9071c-0670-44bd-a6d5-4800e9f6190c

Features

Discover more in the release notes.

Prerequisites

Nvidia graphic card >8 GB VRAM recommended, tested on Python 3.11.4 / Windows 10.

Installation

pip install turnvoice

[!TIP] For faster rendering with GPU prepare your CUDA environment after installation:

For CUDA 12.1
pip install torch==2.3.1+cu211 torchaudio==2.3.1+cu211 --index-url https://download.pytorch.org/whl/cu211

Rendering time is high even with a strong GPU, therefore while it might be possible it is not recommended to run this script on CPU only.

Note: Do not use torch versions >= 2.4 together with cuNN 9.0 because faster_whisper (CTranslate2) does not support this combination yet.

Usage

turnvoice [-i] <YouTube URL|ID|Local File> [-l] <Translation Language> -e <Engine(s)> -v <Voice(s)> -o <Output File>

Submit a string to the 'voice' parameter for each speaker voice you wish to use. If you specify engines, the voices will be assigned to these engines in the order they are listed. Should there be more voices than engines, the first engine will be used for the excess voices. In the absence of a specified engine, the Coqui engine will be used as the default. If no voices are defined, a default voice will be selected for each engine.

Example Command:

Arthur Morgan narrating a cooking tutorial:

turnvoice -i AmC9SmCBUj4 -v arthur.wav -o cooking_with_arthur.mp4

[!NOTE] Requires the cloning voice file (e.g., arthur.wav or .json) in the same directory (you find one in the tests directory).

Workflow

Preparation

Prepare a script with transcription, speaker diarization (and optionally translation or prompting) using:

turnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c --prepare

Translation and prompts should be applied in this preparation step. Engines or voices come later in the render step.

Renderscript Editor

Editor

  1. Open script
    Open the editor.html file. Click on the file open button and navigate to the folder you started turnvoice from. Open download folder. Open the folder with the name of the video. Open the file full_script.txt.
  2. Edit
    The Editor will visualize the transcript and speaker diarization results and start playing the original video now. While playing verify texts, starting times and speaker assignments and adjust them if the detection went wrong.
  3. Save
    Save the script. Remember the path to the file.

Rendering

Render the refined script to generate the final video using:

turnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c --render <path_to_script>

Adjust the path in the displayed CLI command (the editor can't read that information out from the browser).

Assign engines and voices to each speaker track with the -e and -v commands.

Parameters

-i and -l can be used as both positional and optional arguments.

Translation

Translate a video into another language using the -l parameter.

For example, to translate into chinese you could use:

turnvoice https://www.youtube.com/watch?v=ZTH771HIhpg -l zh-CN -v daisy

Output Video
💡 Tip: In the tests folder you find a voice "chinese.json" trained on chinese phonemes.

Languages for Coqui Engine | Shortcut | Language | |----------|-------------| | ar | Arabic | | cs | Czech | | de | German | | en | English | | es | Spanish | | fr | French | | it | Italian | | hu | Hungarian | | ja | Japanese | | ko | Korean | | nl | Dutch | | pl | Polish | | pt | Portuguese | | ru | Russian | | tr | Turkish | | zh-cn | Chinese |
Languages for other engines Make sure to select voice a supporting the language in Azure and System Engine. | Shortcut | Language | |----------|------------------------| | af | Afrikaans | | sq | Albanian | | am | Amharic | | ar | Arabic | | hy | Armenian | | as | Assamese | | ay | Aymara | | az | Azerbaijani | | bm | Bambara | | eu | Basque | | be | Belarusian | | bn | Bengali | | bho | Bhojpuri | | bs | Bosnian | | bg | Bulgarian | | ca | Catalan | | ceb | Cebuano | | ny | Chichewa | | zh-CN | Chinese (Simplified) | | zh-TW | Chinese (Traditional) | | co | Corsican | | hr | Croatian | | cs | Czech | | da | Danish | | dv | Dhivehi | | doi | Dogri | | nl | Dutch | | en | English | | eo | Esperanto | | et | Estonian | | ee | Ewe | | tl | Filipino | | fi | Finnish | | fr | French | | fy | Frisian | | gl | Galician | | ka | Georgian | | de | German | | el | Greek | | gn | Guarani | | gu | Gujarati | | ht | Haitian Creole | | ha | Hausa | | haw | Hawaiian | | iw | Hebrew | | hi | Hindi | | hmn | Hmong | | hu | Hungarian | | is | Icelandic | | ig | Igbo | | ilo | Ilocano | | id | Indonesian | | ga | Irish | | it | Italian | | ja | Japanese | | jw | Javanese | | kn | Kannada | | kk | Kazakh | | km | Khmer | | rw | Kinyarwanda | | gom | Konkani | | ko | Korean | | kri | Krio | | ku | Kurdish (Kurmanji) | | ckb | Kurdish (Sorani) | | ky | Kyrgyz | | lo | Lao | | la | Latin | | lv | Latvian | | ln | Lingala | | lt | Lithuanian | | lg | Luganda | | lb | Luxembourgish | | mk | Macedonian | | mai | Maithili | | mg | Malagasy | | ms | Malay | | ml | Malayalam | | mt | Maltese | | mi | Maori | | mr | Marathi | | mni-Mtei | Meiteilon (Manipuri) | | lus | Mizo | | mn | Mongolian | | my | Myanmar | | ne | Nepali | | no | Norwegian | | or | Odia (Oriya) | | om | Oromo | | ps | Pashto | | fa | Persian | | pl | Polish | | pt | Portuguese | | pa | Punjabi | | qu | Quechua | | ro | Romanian | | ru | Russian | | sm | Samoan | | sa | Sanskrit | | gd | Scots Gaelic | | nso | Sepedi | | sr | Serbian | | st | Sesotho | | sn | Shona | | sd | Sindhi | | si | Sinhala | | sk | Slovak | | sl | Slovenian | | so | Somali | | es | Spanish | | su | Sundanese | | sw | Swahili | | sv | Swedish | | tg | Tajik | | ta | Tamil | | tt | Tatar | | te | Telugu | | th | Thai | | ti | Tigrinya | | ts | Tsonga | | tr | Turkish | | tk | Turkmen | | ak | Twi | | uk | Ukrainian | | ur | Urdu | | ug | Uyghur | | uz | Uzbek | | vi | Vietnamese | | cy | Welsh | | xh | Xhosa | | yi | Yiddish | | yo | Yoruba | | zu | Zulu |

Coqui Engine

Coqui engine is the default engine if no other engine is specified with the -e parameter.

To use voices from Coqui: #### Voices (-v parameter) You may either use one of the predefined coqui voices or clone your own voice. ##### Predefined Voices To use a predefined voice submit the name of one of the following voices: 'Claribel Dervla', 'Daisy Studious', 'Gracie Wise', 'Tammie Ema', 'Alison Dietlinde', 'Ana Florence', 'Annmarie Nele', 'Asya Anara', 'Brenda Stern', 'Gitta Nikolina', 'Henriette Usha', 'Sofia Hellen', 'Tammy Grit', 'Tanja Adelina', 'Vjollca Johnnie', 'Andrew Chipper', 'Badr Odhiambo', 'Dionisio Schuyler', 'Royston Min', 'Viktor Eka', 'Abrahan Mack', 'Adde Michal', 'Baldur Sanjin', 'Craig Gutsy', 'Damien Black', 'Gilberto Mathias', 'Ilkin Urbano', 'Kazuhiko Atallah', 'Ludvig Milivoj', 'Suad Qasim', 'Torcull Diarmuid', 'Viktor Menelaos', 'Zacharie Aimilios', 'Nova Hogarth', 'Maja Ruoho', 'Uta Obando', 'Lidiya Szekeres', 'Chandra MacFarland', 'Szofi Granger', 'Camilla Holmström', 'Lilya Stainthorpe', 'Zofija Kendrick', 'Narelle Moon', 'Barbora MacLean', 'Alexandra Hisakawa', 'Alma María', 'Rosemary Okafor', 'Ige Behringer', 'Filip Traverse', 'Damjan Chapman', 'Wulf Carlevaro', 'Aaron Dreschner', 'Kumar Dahl', 'Eugenio Mataracı', 'Ferran Simen', 'Xavier Hayasaka', 'Luis Moray', 'Marcos Rudaski' *💡 Tip: simply write `-v gracie` as also parts of voice names are recognized and it's case-insensitive* [Samples for every voice](https://github.com/KoljaB/RealtimeTTS/tree/master/tests/coqui_voices) ##### Cloned Voices Submit path to one or more audiofiles containing 16 bit 24kHz mono source material as reference wavs. Example: ``` turnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c -e coqui -v female.wav ``` #### The Art of Choosing a Reference Wav - A 24000, 44100 or 22050 Hz 16-bit mono wav file of 10-30 seconds is your golden ticket. - 24k mono 16 is my default, but I also had voices where I found 44100 32-bit to yield best results - I test voices [with this tool](https://github.com/KoljaB/RealtimeTTS/blob/master/tests/coqui_test.py) before rendering - Audacity is your friend for adjusting sample rates. Experiment with frame rates for best results! #### Fixed TTS Model Download Folder Keep your models organized! Set `COQUI_MODEL_PATH` to your preferred folder. Windows example: ```bash setx COQUI_MODEL_PATH "C:\Downloads\CoquiModels" ```

Elevenlabs Engine

[!NOTE] To use Elevenlabs voices you need the API Key stored in env variable ELEVENLABS_API_KEY

All voices are synthesized with the multilingual-v1 model.

[!CAUTION] Elevenlabs is a pricy API. Focus on short videos. Don't let a work-in-progress script like this run unattended on a pay-per-use API. Bugs could be very annoying when occurring at the end of a pricy long rendering process.

To use voices from Elevenlabs: #### Voices (-v parameter) Submit name(s) of either a generated or predefined voice. Example: ``` turnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c -e elevenlabs -v Giovanni ```

[!TIP] Test rendering with a free engine like coqui first before using pricy ones.

OpenAI Engine

[!NOTE] To use OpenAI TTS voices you need the API Key stored in env variable OPENAI_API_KEY

To use voices from OpenAI: #### Voice (-v parameter) Submit name of voice. Currently only one voice for OpenAI supported. Alloy, echo, fable, onyx, nova or shimmer. Example: ``` turnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c -e openai -v shimmer ```

Azure Engine

[!NOTE] To use Azure voices you need the API Key for SpeechService resource in AZURE_SPEECH_KEY and the region identifier in AZURE_SPEECH_REGION

To use voices from Azure: #### Voices (-v parameter) Submit name(s) of either a generated or predefined voice. Example: ``` turnvoice https://www.youtube.com/watch?v=BqnAeUoqFAM -e azure -v ChristopherNeural ```

System Engine

To use system voices: #### Voices (-v parameter) Submit name(s) of voices as string. Example: ``` turnvoice https://www.youtube.com/watch?v=BqnAeUoqFAM -e system -v David ```

What to expect

Source Quality

Troubleshoot

If you run into "Could not locate cudnn_ops_infer64_8.dll", this is caused by faster_whisper not supporing the combination of cuDNN version greater than 9 and PyTorch version greater than 2.4.

To solve:

Pro Tips

How to exchange a single speaker

First perform a speaker analysis with -a parameter:

turnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM -a

Then select a speaker from the list with -s parameter

turnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM -s 2

License

TurnVoice is proudly under the Coqui Public Model License 1.0.0.

Contact 🤝

Share your funniest or most creative TurnVoice creations with me!

And if you've got a cool feature idea or just want to say hi, drop me a line on

If you like the repo please leave a star
✨ 🌟 ✨

[^1]: State is work-in-progress (early pre-alpha). Ülease expect CLI API changes to come and sorry in advance if anything does not work as expected.
Developed on Python 3.11.4 under Win 10. [^2]: Rubberband is needed to pitchpreserve timestretch audios for fitting synthesis into timewindow. [^3]: ffmpeg is needed to convert mp3 files into wav [^4]: Huggingface access token is needed to download the speaker diarization model for identifying speakers with pyannote.audio. [^5]: Speaker diarization is performed with the pyannote.audio default HF implementation on the vocals track splitted from the original audio. [^6]: Generates costs. Uses gpt-4-1106-preview model and needs OpenAI API Key stored in env variable OPENAI_API_KEY. [^7]: Generates costs. Elevenlabs is pricy, OpenAI TTS, Azure are affordable. Needs API Keys stored in env variables, see engine information for details.