Open dubbing is an AI dubbing system uses machine learning models to automatically translate and synchronize audio dialogue into different languages. It is designed as a command line tool.
At the moment, it is pure experimental and an excuse to help me to understand better STT, TTS and translation systems combined together.
Areas what we will like to explore:
This video on propose shows the strengths and limitations of the system.
Original English video
https://github.com/user-attachments/assets/54c0d37f-0cc8-4ea2-8f8d-fd2d2f4eeccc
Automatic dubbed video in Catalan
https://github.com/user-attachments/assets/99936655-5851-4d0c-827b-f36f79f56190
The support languages depends on the combination of text to speech, translation system and text to speech system used. With Coqui TTS, these are the languages supported (I only tested a very few of them):
Supported source languages: Afrikaans, Amharic, Armenian, Assamese, Bashkir, Basque, Belarusian, Bengali, Bosnian, Bulgarian, Burmese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Gujarati, Haitian, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Lingala, Lithuanian, Luxembourgish, Macedonian, Malayalam, Maltese, Maori, Marathi, Modern Greek (1453-), Norwegian Nynorsk, Occitan (post 1500), Panjabi, Polish, Portuguese, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Vietnamese, Welsh, Yoruba, Yue Chinese
Supported target languages: Achinese, Akan, Amharic, Assamese, Awadhi, Ayacucho Quechua, Balinese, Bambara, Bashkir, Basque, Bemba (Zambia), Bengali, Bulgarian, Burmese, Catalan, Cebuano, Central Aymara, Chhattisgarhi, Crimean Tatar, Dutch, Dyula, Dzongkha, English, Ewe, Faroese, Fijian, Finnish, Fon, French, Ganda, German, Guarani, Gujarati, Haitian, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Iloko, Indonesian, Javanese, Kabiyè, Kabyle, Kachin, Kannada, Kazakh, Khmer, Kikuyu, Kinyarwanda, Kirghiz, Korean, Lao, Magahi, Maithili, Malayalam, Marathi, Minangkabau, Modern Greek (1453-), Mossi, North Azerbaijani, Northern Kurdish, Nuer, Nyanja, Odia, Pangasinan, Panjabi, Papiamento, Polish, Portuguese, Romanian, Rundi, Russian, Samoan, Sango, Shan, Shona, Somali, South Azerbaijani, Southwestern Dinka, Spanish, Sundanese, Swahili (individual language), Swedish, Tagalog, Tajik, Tamasheq, Tamil, Tatar, Telugu, Thai, Tibetan, Tigrinya, Tok Pisin, Tsonga, Turkish, Turkmen, Uighur, Ukrainian, Urdu, Vietnamese, Waray (Philippines), Welsh, Yoruba
To install the open_dubbing in all platforms:
pip install open_dubbing
If you want to install also Coqui-tts, do:
pip install open_dubbing[coqui]
In Linux you also need to install:
sudo apt install ffmpeg
If you are going to use Coqui-tts you also need to install espeak-ng:
sudo apt install espeak-ng
In macOS you also need to install:
brew install ffmpeg
If you are going to use Coqui-tts you also need to install espeak-ng:
brew install espeak-ng
Windows currently works but it has not been tested extensively.
You also need to install ffmpeg for Windows. Make sure that is the system path.
pyannote/segmentation-3.0
user conditionspyannote/speaker-diarization-3.1
user conditionshf.co/settings/tokens
.Quick start
open-dubbing --input_file video.mp4 --target_language=cat --hugging_face_token=TOKEN
Where:
By default, the source language is predicted using the first 30 seconds of the video. If this does not work (e.g. there is only music at the beginning), use the parameter _sourcelanguage to specify the source language using ISO 639-3 language codes (e.g. 'eng' for English).
To get a list of available options:
open-dubbing --help
There are cases where you want to manually adjust the text generated automatically for dubbing, the voice used or the timings.
After you have executed open-dubbing you have the intermediate files and the outcome dubbed file in the selected output directory.
You can edit the file _utterance_metadataXXX.json (where XXX is the target language code), make manual adjustments, and generate the video again.
See an example JSON:
"utterances": [
{
"start": 7.607843750000001,
"end": 8.687843750000003,
"speaker_id": "SPEAKER_00",
"path": "short/chunk_7.607843750000001_8.687843750000003.mp3",
"text": "And I love this city.",
"for_dubbing": true,
"gender": "Male",
"translated_text": **"I m'encanta aquesta ciutat."**,
"assigned_voice": "ca-ES-EnricNeural",
"speed": 1.3,
"dubbed_path": "short/dubbed_chunk_7.607843750000001_8.687843750000003.mp3",
"hash": "b11d7f0e2aa5475e652937469d89ef0a178fecea726f076095942d552944089f"
},
Imagine that you have changed the translated_text. To generated the post-edited video:
open-dubbing --input_file video.mp4 --target_language=cat --hugging_face_token=TOKEN --update
The update parameter changes the behavior of open-dubbing and instead of producing a full dubbing it rebuilds the already existing dubbing incorporating any change made into the JSON file.
Fields that are usefull to modify are: translated_text, gender (of the voice) or speed.
For more detailed documentation on how the tool works and how to use it, see our documentation page.
Core libraries used:
And very special thanks to ariel from which we leveraged parts of their code base.
See license
Email address: Jordi Mas: jmas@softcatala.org