A desktop application that transcribes audio from files, microphone input or YouTube videos with the option to translate the content and create subtitles.
Audiotext transcribes the audio from an audio file, video file, microphone input, directory, or YouTube video into any of the 99 different languages it supports. You can transcribe using the Google Speech-to-Text API, the Whisper API, or WhisperX. The last two methods can even translate the transcription or generate subtitles!
You can also choose the theme you like best. It can be dark, light, or the one configured in the system.
Install FFmpeg to execute the program. Otherwise, it won't be able to process the audio files.
To check if you have it installed on your system, run ffmpeg -version
. It should return something similar to this:
ffmpeg version 5.1.2-essentials_build-www.gyan.dev Copyright (c) 2000-2022 the FFmpeg developers
built with gcc 12.1.0 (Rev2, Built by MSYS2 project)
configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-lzma --enable-zlib --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-sdl2 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-libass --enable-libfreetype --enable-libfribidi --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-ffnvcodec --enable-nvdec --enable-nvenc --enable-d3d11va --enable-dxva2 --enable-libmfx --enable-libgme --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libtheora --enable-libvo-amrwbenc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-librubberband
libavutil 57. 28.100 / 57. 28.100
libavcodec 59. 37.100 / 59. 37.100
libavformat 59. 27.100 / 59. 27.100
libavdevice 59. 7.100 / 59. 7.100
libavfilter 8. 44.100 / 8. 44.100
libswscale 6. 7.100 / 6. 7.100
libswresample 4. 7.100 / 4. 7.100
If the output is an error, it is because your system cannot find the ffmpeg
system variable, which is probably because you don't have it installed on your system. To install ffmpeg
, open a command prompt and run one of the following commands, depending on your operating system:
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Arch Linux
sudo pacman -S ffmpeg
# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
audiotext
folder and double-click the Audiotext
executable file.git clone https://github.com/HenestrosaDev/audiotext.git
.audiotext
by running cd audiotext
.virtualenv
, you would run virtualenv venv
.(Optional but recommended) Activate the virtual environment:
# on Windows
. venv/Scripts/activate
# if you get the error `FullyQualifiedErrorId : UnauthorizedAccess`, run this:
Set-ExecutionPolicy Unrestricted -Scope Process
# and then . venv/Scripts/activate
# on macOS and Linux
source venv/Scripts/activate
pip install -r requirements.txt
to install the dependencies.pip install -r requirements-dev.txt
to install the development dependencies.pre-commit install
to install the pre-commit hooks in your .git/
directory..env.example
file as .env
to the root of the directory.python src/app.py
to start the program.pyaudio
package. Here is a StackOverflow post explaining how to solve this issue.pprint(response_text, indent=4)
in the recognize_google
function from the __init__.py
file of the SpeechRecognition
package to avoid opening a command line along with the GUI. Otherwise, the program would not be able to use the Google API transcription method because pprint
throws an error if it cannot print to the CLI, preventing the code from generating the transcription. The same applies to the lines using the logger
package in the moviepy/audio/io/ffmpeg_audiowriter
file from the moviepy
package. There is also a change in the line 169 that changes logger=logger
to logger=None
to avoid more errors related to opening the console.Once you open the Audiotext executable file (explained in the Getting Started section), you'll see something like this:
The target language for the transcription. If you use the Whisper API or the WhisperX transcription methods, you can set this to a language other than the one spoken in the audio in order to translate it to the selected language.
For example, to translate an English audio into French, you would set Transcription language
to French, as shown in the video below:
https://github.com/user-attachments/assets/e68d9b90-3978-4ffb-9b62-bd3d57a1a33d
This is an unofficial way to perform translations, so be sure to double-check the generated transcription for errors.
There are three transcription methods available in Audiotext:
Google Speech-To-Text API (hereafter referred to as Google API): Requires an Internet connection. It doesn't punctuate sentences (the punctuation is produced by Audiotext), and the quality of the resulting transcriptions often requires manual adjustment due to lower quality compared to the Whisper API or WhisperX. In its free tier, usage is limited to 60 minutes per month, but this limit can be extended by adding an API key.
Whisper API: Requires an Internet connection. This method is intended for people whose machines are not powerful enough to run WhisperX gracefully. It has fewer options than WhisperX, but the quality of the transcriptions is similar to those generated by the large-v2
model of Whisper X. However, you need to set an OpenAI API key to use this method. See the Whisper API Key section for more information.
WhisperX: Selected by default. It doesn't require an Internet connection because the entire transcription process takes place locally on your computer. As a result, it's much more demanding of hardware resources than the other remote transcription methods. WhisperX can run on CPUs and CUDA GPUs, although it performs better on the latter. The quality of the transcription depends on the selected model size and computation type. In addition, WhisperX offers a wider range of features, including a more customizable subtitle generation process than the Whisper API and more output file types. It has no usage restrictions while remaining completely free.
You can transcribe from four different audio sources:
File (see image above): Click the file explorer icon to select the file you want to transcribe, or manually enter the path to the file in the Path
input field. You can transcribe audio from both audio and video files.
Note that the file explorer has the All supported files
option selected by default. To select only audio files or video files, click the combo box in the lower right corner of the file explorer to change the file type, as marked in red in the following image:
Directory: Click the file explorer icon to select the directory containing the files you want to transcribe, or manually enter the path to the directory in the Path
input field. Note that the Autosave
option is checked and cannot be unchecked because each file's transcription will automatically be saved in the same path as the source file.
For example, let's use the following directory as a reference:
└───files-to-transcribe
│ paranoid-android.mp3
│ the-past-recedes.flac
│
└───movies
mulholland-dr-2001.avi
seul-contre-tous-1998.mp4
After transcribing the files-to-transcribe
directory using WhisperX, with the Overwrite existing files
option unchecked and the output file types .vtt
and .txt
selected, the folder structure will look like this:
└───files-to-transcribe
│ paranoid-android.mp3
│ paranoid-android.txt
│ paranoid-android.vtt
│ the-past-recedes.flac
│ the-past-recedes.txt
│ the-past-recedes.vtt
│
└───movies
mulholland-dr-2001.avi
mulholland-dr-2001.txt
mulholland-dr-2001.vtt
seul-contre-tous-1998.mp4
seul-contre-tous-1998.txt
seul-contre-tous-1998.vtt
If we transcribe the directory again with the Google API and the Overwrite existing files
option unchecked, Audiotext won't process any files because there are already .txt
files corresponding to all the files in the directory. However, if we added the file endors-toi.wav
to the root of files-to-transcribe
, it would be the only file that would be processed because it doesn't have a .txt
attached to it. The same would happen in the WhisperX scenario, since endors-toi.wav
has no transcription files generated.
Note that if we check the Overwrite existing files
option, all files will be processed again and the existing transcription files will be overwritten.
Microphone: To start recording, simply click the Start recording
button to begin the process. The text of the button will change to Stop recording
and its color will change to red. Click it to stop recording and generate the transcription.
Here is a video demonstrating this feature:
https://github.com/user-attachments/assets/61f2173b-bcfb-4251-a910-0cf6b37598c6
Note that your operating system must recognize an input source, otherwise an error message will appear in the text box indicating that no input source was detected.
YouTube video: Requires an Internet connection to get the audio of the video. To generate the transcription, simply enter the URL of the video in the YouTube video URL
field and click the Generate transcription
button when you are finished adjusting the settings.
When you click on the Save transcription
button, you'll be prompted for a file explorer where you can name the transcription file and select the path where you want to save it. Please note that any text entered or modified in the textbox WILL NOT be included in the saved transcription.
Unchecked by default. If checked, the transcription will automatically be saved in the root of the folder where the file to transcribe is stored. If there are already existing files with the same name, they won't be overwritten. To do that, you'll need to check the Overwrite existing files
option (see below).
Note that if you create a transcription using the Microphone
or YouTube
audio sources with the Autosave
action enabled, the transcription files will be saved in the root of the audiotext-vX.X.X
directory.
This option can only be checked if the Autosave
option is checked. If Overwrite existing files
is checked, existing transcriptions in the root directory of the file to be transcribed will be overwritten when saving.
For example, let's use this directory as a reference:
└───audios
foo.mp3
foo.srt
foo.txt
If we transcribe the audio file foo.mp3
with the output file types .json
, .txt
and .srt
and the Autosave
and Overwrite existing files
options checked, the files foo.srt
and foo.txt
will be overwritten and the file foo.json
will be created.
On the other hand, if we transcribe the audio file foo.mp3
with the same output file types, with the option Autosave
checked but without the option Overwrite existing files
, the file foo.json
will still be created, but the files foo.srt
and foo.txt
will remain unchanged.
The Google API options
frame appears if the selected transcription method is Google API. See the Transcription Method section to know more about the Google API.
#### Google API Key Since the program uses the free **Google API** tier by default, which allows you to transcribe up to 60 minutes of audio per month for free, you may need to add an API key if you want to make extensive use of this feature. To do so, click the `Set API key` button. You'll be presented with a dialog box where you can enter your API key, which will **only** be used to make requests to the API.
Remember that **WhisperX** provides fast, unlimited audio transcription that supports translation and subtitle generation for free, unlike the **Google API**. Also note that Google charges for the use of the API key, for which **Audiotext** is not responsible. ### Whisper API Options The `Whisper API options` frame appears if the selected transcription method is **Whisper API**. See the [Transcription Method](#transcription-method) section to know more about the **Whisper API**.
#### Whisper API Key As noted in the [Transcription Method](#transcription-method) section, an [OpenAI API key]((https://platform.openai.com/api-keys)) is required to use this transcription method. Otherwise, you won't be able to use it. To add it, click the `Set OpenAI API key` button. You'll be presented with a dialog box where you can enter your API key, which will **only** be used to make requests to the API.
OpenAI charges for the use of the API key, for which **Audiotext** is not responsible. See the [Troubleshooting](#troubleshooting) section if you get error `429` on your first request with an API key. #### Response Format The format of the transcript output, in one of these options: - `json` - `srt` (subtitle file type) - `text` - `verbose_json` - `vtt` (subtitle file type) Defaults to `text`. #### Temperature The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit. Defaults to 0. #### Timestamp Granularities The timestamp granularities to populate for this transcription. `Response format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. **Note**: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency. Defaults to `segment`. ### WhisperX Options The **WhisperX** options appear when the selected transcription method is **WhisperX**. You can select the output file types of the transcription and whether to translate the transcription into English.
#### Output File Types You can select one or more of the following transcription output file types: - `.aud` - `.json` - `.srt` (subtitle file type) - `.tsv` - `.txt` - `.vtt` (subtitle file type) If you select one of the two subtitle file types (`.vtt` and `.srt`), the `Subtitle options` frame will be displayed with more options (read more [here](#subtitle-options)). #### Translate to English To translate the transcription to English, simply check the `Translate to English` checkbox before generating the transcription, as shown in the video below. https://github.com/user-attachments/assets/e614201c-25f2-4ec7-8478-3b63aade0c44 If you want to translate the audio to another language, check the [Transcription Language](#transcription-language) section. ### Subtitle Options When you select the `.srt` and/or the `.vtt` output file type(s), the `Subtitle options` frame will be displayed. Note that the input options only apply to the `.srt` and `.vtt` files:
To get the subtitle file(s) after the audio is transcribed, you can either check the `Autosave` option before generating the transcription or click `Save transcription` and select the path where you want to save them as explained in the [Save Transcription](#save-transcription) section. #### Highlight Words Underline each word as it's spoken in `.srt` and `.vtt` subtitle files. Not checked by default. #### Max. Line Count The maximum number of lines in a segment. `2` by default. #### Max. Line Width The maximum number of characters in a line before breaking the line. `42` by default. ### Advanced Options When you click the `Show advanced options` button in the `WhisperX options` frame, the `Advanced options` frame appears, as shown in the figure below.
It's highly recommended that you don't change the default configuration unless you're having problems with **WhisperX** or you know exactly what you're doing, especially the `Compute type` and `Batch size` options. Change them at your own risk and be aware that you may experience problems, such as having to reboot your system if the GPU runs out of VRAM. #### Model Size There are five main ASR (Automatic Speech Recognition) model sizes that offer tradeoffs between speed and accuracy. The larger the model size, the more VRAM it uses and the longer it takes to transcribe. Unfortunately, **WhisperX** hasn't provided specific performance data for each model, so the table below is based on the one detailed in [OpenAI's Whisper README](https://github.com/openai/whisper). According to **WhisperX**, the `large-v2` model requires <8GB of GPU memory and batches inference for 70x real-time transcription (taken from the project's [README](https://github.com/m-bain/whisperX)). | Model | Parameters | Required VRAM | |:--------:|:----------:|:--------------:| | `tiny` | 39 M | ~1 GB | | `base` | 74 M | ~1 GB | | `small` | 244 M | ~2 GB | | `medium` | 769 M | ~5 GB | | `large` | 1550 M | <8 GB | > [!NOTE] >`large` is divided into three versions: `large-v1`, `large-v2`, and `large-v3`. The default model size is `large-v2`, since `large-v3` has some bugs that weren't as common in `large-v2`, such as hallucination and repetition, especially for certain languages like Japanese. There are also more prevalent problems with missing punctuation and capitalization. See the announcements for the [`large-v2`](https://github.com/openai/whisper/discussions/661) and the [`large-v3`](https://github.com/openai/whisper/discussions/1762) models for more insight into their differences and the issues encountered with each. The larger the model size, the lower the WER (Word Error Rate in %). The table below is taken from [this Medium article](https://blog.ml6.eu/fine-tuning-whisper-for-dutch-language-the-crucial-role-of-size-dd5a7012d45f), which analyzes the performance of pre-trained Whisper models on common Dutch speech. | Model | WER | |:--------:|:-----:| | tiny | 50.98 | | small | 17.90 | | large-v2 | 7.81 | #### Compute Type This term refers to different data types used in computing, particularly in the context of numerical representation. It determines how numbers are stored and represented in a computer's memory. The higher the precision, the more resources will be needed and the better the transcription will be. There are three possible values for **Audiotext**: - `int8`: Default if using CPU. It represents whole numbers without any fractional part. Its size is 8 bits (1 byte) and it can represent integer values from -128 to 127 (signed) or 0 to 255 (unsigned). It is used in scenarios where memory efficiency is critical, such as in quantized neural networks or edge devices with limited computational resources. - `float16`: Default if using CUDA GPU. It's a half precision type representing 16-bit floating point numbers. Its size is 16 bits (2 bytes). It has a smaller range and precision compared to `float32`. It's often used in applications where memory is a critical resource, such as in deep learning models running on GPUs or TPUs. - `float32`: Recommended for CUDA GPUs with more than 8 GB of VRAM. It's a single precision type representing 32-bit floating point numbers, which is a standard for representing real numbers in computers. Its size is 32 bits (4 bytes). It can represent a wide range of real numbers with a reasonable level of precision. #### Batch Size This option determines how many samples are processed together before the model parameters are updated. It doesn't affect the quality of the transcription, only the generation speed (the smaller, the slower). For simplicity, let's divide the possible batch size values into two groups: - **Small batch size (0