Speech Dataset Generator by David Martin Rius

This repository is dedicated to creating datasets suitable for training text-to-speech or speech-to-text models. The primary functionality involves transcribing audio files, enhancing audio quality when necessary, and generating datasets.

Here are the key functionalities of the project:

Dataset Generation: Creation of multilingual datasets with Mean Opinion Score (MOS).
Silence Removal: It includes a feature to remove silences from audio files, enhancing the overall quality.
Sound Quality Improvement: It improves the quality of the audio when needed.
Audio Segmentation: It can segment audio files within specified second ranges.
Transcription: The project transcribes the segmented audio, providing a textual representation.
Gender Identification: It identifies the gender of each speaker in the audio.
Pyannote Embeddings: Utilizes pyannote embeddings for speaker detection across multiple audio files.
Automatic Speaker Naming: Automatically assigns names to speakers detected in multiple audios.
Multiple Speaker Detection: Capable of detecting multiple speakers within each audio file.
Store speaker embeddings: The speakers are detected and stored in a Chroma database, so you do not need to assign a speaker name.
Syllabic and words-per-minute metrics
Multiple input sources: You can either use your own files or download content by pasting URLs from sources such as YouTube, LibriVox and TED Talks.

Example of the output folder:

outputs
|-- main_data.csv
|
|-- chroma_database
|
|-- enhanced_audios
|
|-- ljspeech
|   |-- wavs
|   |   |-- 1272-128104-0000.wav
|   |   |-- 1272-128104-0001.wav
|   |   |-- ...
|   |   |-- 1272-128104-0225.wav
|   |-- metadata.csv
|
|-- librispeech
|   |-- speaker_id1
|   |   |-- book_id1
|   |   |   |-- transcription.txt
|   |   |   |-- file1.wav
|   |   |   |-- file2.wav
|   |   |   |-- ...
|   |-- speaker_id2
|   |   |-- book_id1
|   |   |   |-- transcription.txt
|   |   |   |-- file1.wav
|   |   |   |-- file2.wav
|   |   |   |-- ...

Example of the main_data.csv content:

Consider that the values provided are purely fictitious and intended solely for illustrative purposes in this example.


| text                    | audio_filename               | speaker_id     | gender     | duration    | language    | words_per_minute   | syllables_per_minute |
|-------------------------|------------------------------|----------------|------------|-------------|-------------|--------------------|----------------------|
| Hello, how are you?     | wavs/1272-128104-0000.wav    | Speaker12      | male       | 4.5         | en          | 22.22              | 1.11                 |
| Hola, ¿cómo estás?      | wavs/1272-128104-0001.wav    | Speaker45      | female     | 6.2         | es          | 20.97              | 0.81                 |
| This is a test.         | wavs/1272-128104-0002.wav    | Speaker23      | male       | 3.8         | en          | 26.32              | 1.32                 |
| ¡Adiós!                 | wavs/1272-128104-0003.wav    | Speaker67      | female     | 7.0         | es          | 16.43              | 0.57                 |
| ...                     | ...                          | ...            | ...        | ...         | ...         | ...                | ...                  |
| Goodbye!                | wavs/1272-128104-0225.wav    | Speaker78      | male       | 5.1         | en          | 1.41               | 1.18                 |

Installation

Please note that this project has been tested and verified to work on Ubuntu 22. Although it has not been tested on macOS and Windows nor on other unix distributions.


python3.10 -m venv venv 

source venv/bin/activate

pip install -r requirements.txt

or

pip install -e .

#If you are going to use this program outside of this project folder do this:
export PYTHONPATH=/path/to/your/speech-dataset-generator:$PYTHONPATH

Needed agreement to run the code

Important: Make sure to agree to share your contact information to access the pyannote embedding model. Similarly, access to the pyannote speaker diarization model may require similar agreement.

Huggingface

You need to provide a HuggingFace token in a .env file

HF_TOKEN=yourtoken

Usage

The main script speech_dataset_generator/main.py accepts command-line arguments for specifying the input file, output directory, time range, and types of enhancers. You can process a single file or an entire folder of audio files. Also you can use a youtube video or a youtube playlist as input.


python speech_dataset_generator/main.py --input_file_path <path_to_audio_file> --output_directory <output_directory> --range_times <start-end> --enhancers <enhancer_types>

--input_file_path: (source) Path to the input audio file. Cannot be used with input folder.
--input_folder: (source) Path to the input folder containing audio files. Cannot be used with input_file_path
--youtube_download: (source) Link or links separated by space of youtube videos or playlists.
--librivox_download: (source) Link or links separated by space of LibriVox audiobooks.
--tedtalks_download: (source) Aggregate Ted Talks audio or video links by separating them with spaces. Copy these links directly from the Share button URL, in the "Download" section, where is MP4 and Audio.
--output_directory: Output directory for audio files.
--range_times: Specify a range of two integers in the format "start-end". Default is 4-10. Clarification: in the first instance, the ranges are determined by WhisperX. Therefore you cannot modify them, but when using this parameter you can narrow and filter the ranges.
--enhancers: You can use audio enhancers: --enhancers deepfilternet resembleai mayavoz. Will be executed in the order you write it. By default no enhancer is set. By now deepfilternet gives the best results when enhancing and denoising an audio.
--datasets: there are available extra dataset types: metavoice and librispeech. librispeech is in beta version. --datasets metavoice librispeech

Examples:

Input from a file:

#No enhancer is used
python speech_dataset_generator/main.py --input_file_path /path/to/audio/file.mp3 --output_directory /output/directory --range_times 5-10 --datasets metavoice

#Using deepfilternet enhancer
python speech_dataset_generator/main.py --input_file_path /path/to/audio/file.mp3 --output_directory /output/directory --range_times 4-10 --enhancers deepfilternet

#Using resembleai enhancer
python speech_dataset_generator/main.py --input_file_path /path/to/audio/file.mp3 --output_directory /output/directory --range_times 4-10 --enhancers resembleai

# Combining enhancers
python speech_dataset_generator/main.py --input_file_path /path/to/audio/file.mp3 --output_directory /output/directory --range_times 4-10 --enhancers deepfilternet resembleai

Input from a folder:

python speech_dataset_generator/main.py --input_folder /path/to/folder/of/audios --output_directory /output/directory --range_times 4-10 --enhancers deepfilternet

Input from youtube (single video or playlists):

# Youtube single video
python speech_dataset_generator/main.py --youtube_download https://www.youtube.com/watch\?v\=ID --output_directory /output/directory --range_times 5-15 --enhancers deepfilternet resembleai

#Combining a youtube video + input file
python speech_dataset_generator/main.py --youtube_download https://www.youtube.com/watch\?v\=ID  --input_file_path /path/to/audio/file.mp3 --output_directory /output/directory --range_times 5-15 --enhancers deepfilternet resembleai

#Combining youtube video + input folder
python speech_dataset_generator/main.py --youtube_download https://www.youtube.com/watch\?v\=ID  --input_folder /path/to/folder/of/audios --output_directory /output/directory --range_times 5-15 --enhancers deepfilternet resembleai

Input from LibriVox (one or multiple audiobooks):

# LibriVox single audiobook
python speech_dataset_generator/main.py --librivox_download https://librivox.org/audio-book-url/ --output_directory /output/directory --range_times 5-15 --enhancers deepfilternet resembleai

#Multiple LibriVox audiobooks at a time, in this example there are just 2, but you can pass n urls
python speech_dataset_generator/main.py --librivox_download https://librivox.org/audio-book-url/ https://librivox.org/another-audio-book-url/ --output_directory /output/directory --range_times 5-15 --enhancers deepfilternet resembleai

#Combining a LibriVox audiobook + input file
python speech_dataset_generator/main.py --librivox_download https://librivox.org/audio-book-url/  --input_file_path /path/to/audio/file.mp3 --output_directory /output/directory --range_times 5-15 --enhancers deepfilternet resembleai

#Combining LibriVox audiobook + input folder
python speech_dataset_generator/main.py --librivox_download https://librivox.org/audio-book-url/  --input_folder /path/to/folder/of/audios --output_directory /output/directory --range_times 5-15 --enhancers deepfilternet resembleai

#Also you can download Youtube audios combined with LibriVox
python speech_dataset_generator/main.py --librivox_download https://librivox.org/audio-book-url/ --youtube_download https://www.youtube.com/watch\?v\=ID --output_directory /output/directory --range_times 5-15 --enhancers deepfilternet resembleai

Input from Ted Talks (one or multiple Ted Talks):

# Ted Talks single video
python speech_dataset_generator/main.py --tedtalks_download https://download.ted.com/talks/video-talk.mp3 --output_directory /output/directory --range_times 5-15 --enhancers deepfilternet resembleai

#Multiple Ted Talks videos at a time, in this example there are just 2, but you can pass n urls
python speech_dataset_generator/main.py --tedtalks_download https://download.ted.com/talks/video-talk.mp3 https://download.ted.com/talks/another-video-talk.mp3 --output_directory /output/directory --range_times 5-15 --enhancers deepfilternet resembleai

#Combining a Ted Talks video + input file
python speech_dataset_generator/main.py --tedtalks_download https://download.ted.com/talks/video-talk.mp3  --input_file_path /path/to/audio/file.mp3 --output_directory /output/directory --range_times 5-15 --enhancers deepfilternet resembleai

#Combining Ted Talks video + input folder
python speech_dataset_generator/main.py --tedtalks_download https://download.ted.com/talks/video-talk.mp3  --input_folder /path/to/folder/of/audios --output_directory /output/directory --range_times 5-15 --enhancers deepfilternet resembleai

Notes

Multilingual:

This project uses Whisper, making it multilingual. Here you can see the current supported language list.

Audio enhancer argument

You can combine --enhancers. There are available "deepfilternet", "resembleai" and "mayavoz".

If you pass multiples those will be executed in the order that are passed. In the case you don't pass enhancers no enhancer will be used.

By default, no enhancer is used.

You can combine them all the enhancers in the input.

Deepfilternet

I suggest using deepfilternet for noisy audios. It is the one that gives the best results when denoising.

Resembleai

The output sound of resembleai sometimes can be a little distorted. So, it is not always a good choise. It can denoise and enhance. If you are combining deepfilternet and resembleai, you can disble resembleai denoising.

In the case of resembleai you can play with its parameters at audio_manager.py

solver = "midpoint" #There is "rk4", "euler" and "midpoint" by default

denoising = True

nfe = 128 #range from 1 to 128, if the output sounds like a cassete you can reduce this value

tau = 0 #range from 0 to 1, better if disabled

Mayavoz

The pretrained model of mayavoz only works with a sampling rate of 16000. Only recommended if the input source is also at 16000 hz.

The audio is not always 100% splitted into sub files

An input audio may not be used completely. Here some reasons:

The range_times do not fit a transcripted segment.
The segment has music or not enough quality (MOS under 3), even when enhanced.

If you are not using enhancers and the segments are being discarted because of bad quality you can try --enhancers argument with deepfilternet, resembleai, mayavoz or combine them. See examples section to learn how to use it.

Gender detection

You can use an input audio with multiple speakers and multiple genders. Each speaker will be separated into a fragment and from that fragment the gender will be identified.

There is an example audio in this project with this case. It is in ./assets/example_audio_1.mp3 You can try it without coding in speech_dataset_generator_example.ipynb

Next Steps

External input sources

[X] Youtube
[X] Librivox
[X] Ted talks

Vector database

[X] Store speaker embeddings in Chroma vector database

Refactor code

[X] Everything is inside main.py The code needs to be reorganized.

Speech rate

[X] Detect the speech speed rate for each sentence and add it to the csv output file. The metrics are words per minute (wpm) and syllables per minute (spm)

Audio enhancers

[X] deepfilternet
[X] resembleai
[X] mayavoz
[ ] espnet speech enhancement

Docker image

[ ] Create a docker image for ease of use.

Example of docker usage (image not available yet)

docker run -p 4000:80 -e HF_TOKEN=your_hf_token \
  -v /your/local/output/folder:/app/output \
  --gpus all \
  davidmartinrius/speech-dataset-generator \
  --input_file /app/assets/example_audio_1.wav \
  --output_directory /app/output \
  --range_times 4-10 \
  --enhancers deepfilternet resembleai

docker run -p 4000:80 -e HF_TOKEN=your_hf_token \
  -v /your/local/output/folder:/app/output \
  -v /your/audio/file.mp3:/app/file.wav \
  --gpus all \
  davidmartinrius/speech-dataset-generator \
  --input_file /app/file.wav \
  --output_directory /app/output \
  --range_times 4-10 \
  --enhancers deepfilternet resembleai

Google colab

[ ] Add a speech_dataset_generator_example.ipynb file with all available options applied to some noisy audios and good quality audios.

Add age classification and new gender classification

[X] https://github.com/Anvarjon/Age-Gender-Classification Finally, this won't be integrated as it is too imprecise and only works well with the trained dataset, but not with unseen samples.

Emotion regognition

[ ] https://github.com/ddlBoJack/emotion2vec

Upload to PyPi

[ ] Still pending. There is an uploaded PyPi package but does not work yet. Got some issues setup.py because some of the required packages are not available in PyPi. I am still looking for a way to install those packages. So, by now install the package from requirements.txt or setup.py.

Support multiple datasets

Generator of multiple types of datasets:

[X] LJSpeech This is the default one. When you generate a new dataset a LJSpeech format is given. It still does not split by train/dev/test, but creates a metadata.csv
[X] Metavoice-src Example of the dataset: https://github.com/metavoiceio/metavoice-src/blob/main/datasets/sample_dataset.csv
[ ] LibriSpeech Currently in development. Work in progress
[ ] Common Voice 11
[ ] VoxPopuli
[ ] TED-LIUM
[ ] GigaSpeech
[ ] SPGISpeech
[ ] Earnings-22
[ ] AMI
[ ] VCTK

Dataset converter.

For example, from LibriSpeech to Common Voice and vice versa, etc.

I have to look for a way to extract all the needed features for each dataset type. Also find the best way to divide the dataset into train, dev and test taking into account the input data provided by the user.

Gradio interface

[ ] Generate datasets
[ ] Dataset converter

Runpod serverless instance

In case you do not have a gpu or you want to distribute this as a service.

runpod is a cloud GPU on demand. It has a good integration with python and docker. Also it has an affordable pricing.

[ ] Explain how to create a storage in runpod
[ ] Create a base install to the storage with a Pod
[ ] Launch a serverless instance with a Docker instance of this project
[ ] Call the serverless custom API endpoints to upload files, download generated datasets, convert datasets to other types of datasets, etc

Used packages in this project

This project uses several open-source libraries and tools for audio processing. Special thanks to the contributors of these projects.

Python 3.10
whisperx (v3.1.1)
faster-whisper (1.0.0)
pydub (v0.25.1)
python-dotenv (v1.0.1)
inaSpeechSegmenter (v0.7.7)
unsilence (v1.0.9)
deepfilternet
resemble-enhance (v0.0.1)
speechmetrics
pyannote (embedding model and speaker diarization model)
yt-dlp
Chroma
mayavoz

License

If you plan to use this project in yours: whisperX is currently under the BSD-4-Clause license, yt-dlp has no license and all others are under the MIT license or Apache 2.0 license.

This project is licensed under the MIT License.

Give it a star ⭐️

Did you find this project useful? If so please, consider giving it a star! Your support is greatly appreciated and helps to increase the visibility of the project. Thank you! 😊

davidmartinrius / speech-dataset-generator

readme