Segment audio with Silero VAD and pipeline with Transcriber

Implements Silero VAD for segmenting long audio in the SileroVADSegmenter class and pipelines the class with the transcriber. When creating a transcriber object, the user can specify segment chunk size and pause length. By default, segment chunk size is 20 seconds and audio is automatically segmented if it is longer than the chunk size. Instead of using Silero's built in functions for segmenting (which are unable to guarantee segment length), this implementation uses a probabilistic divide and conquer algorithm to guarantee that segments do not exceed the max segment length. A threshold value is also used to filter out windows of audio that contain a low probability of speech.

Segmentation performance was evaluated on Silero VAD and Pydub with seamless. Silero performed slightly better so was selected for implementation. The evaluation was performed on a randomly sampled chunk of GigaSpeech under different conditions.

About the dataset:

Total hours of audio: 3.6
Mean audio duration: .3 hours
Audio files were selected randomly from the first 150 GB of GigaSpeech

Evaluation of Silero using probabilistic divide and conquer algorithm WITHOUT probability threshold filtering, 1 second pause length and 10 second max segment length:

Average CER: 0.69
Average Edit Distance: 12603.0

Mean segment duration: 5.9
Max segment duration: 7.9
Mean number of segments per audio sample: 182.6

Evaluation of Silero using probabilistic divide and conquer algorithm WITH probability threshold filtering, 1 second pause length and 10 second max segment length:

Average CER: 0.69
Average Edit Distance: 11894.5

Mean segment duration: 6.7
Max segment duration: 9.9
Mean number of segments per audio sample: 141.5

Evaluation of Silero using Silero's segmentation algorithm (previous commits), 1 second pause length and 10 second max segment length:

Average CER: 0.67
Average edit distance: 11370.3

Segmentation statistics:
Mean segment duration: 7.7 sec
Max segment duration: 107.6 sec
Mean number of segments per audio sample: 124.3

Evaluation of Pydub segmentation with 1 second pause length and 10 second max segment length:

Average CER: 0.69
Average edit distance: 11642.9

Segmentation statistics:
Mean segment duration: 9.4 sec
Max segment duration: 10.0 sec
Mean number of segments per audio sample: 114.1

You can use the colab notebook I created for evaluation here: https://colab.research.google.com/drive/1KCRlGnUfKU5_T8bl_YFo_bf7ZawoMb6n?usp=sharing

Use this example for manual testing:

import torch

from seamless_communication.inference import Transcriber

model_name = "seamlessM4T_v2_large"
vocoder_name = "vocoder_v2" if model_name == "seamlessM4T_v2_large" else "vocoder_36langs"

transcriber = Transcriber (
    model_name,
    device=torch.device("cpu"),
    dtype=torch.float32,
)

input_audio = "example.wav"

txt = transcriber.transcribe(audio=input_audio, src_lang="eng")

print("Translated text: ", txt)
print()

facebookresearch / seamless_communication

Segment audio with Silero VAD and pipeline with Transcriber #406