facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.51k stars 1.02k forks source link

Denoise audio with Demucs and pipeline with Transcriber #441

Closed am831 closed 1 month ago

am831 commented 2 months ago

Implements Demucs for denoising audio in the Demucs class and pipelines the class with the transcriber. When creating a transcriber object, the user can specify whether or not to apply denoising along with other parameters to customize denoising functionality. Demucs separates audio into vocals and other sounds from the original file, but currently the user only has access to the separated vocals and other wav files are discarded.

Currently, you need to manually install demucs to use this: pip install git+https://github.com/facebookresearch/demucs#egg=demucs

Transcription performance was evaluated on both noisy audio and the denoised audio. Transcription is slightly better after denoising.

About the dataset:

Total hours of audio: 20.96
Mean audio file duration: 15.72 seconds

I used a subset of the VOiCES dataset, called VOiCES_devkit: https://iqtlabs.github.io/voices/

Evaluation of Seamless transcription performance on the noisy audio:

Average CER: 0.18
Average Edit Distance: 44.1

Evaluation of Seamless transcription performance on the dataset after denoising:

Average CER: 0.16
Average Edit Distance: 38.6

Use this example for manual testing:

import torch

from seamless_communication.inference import Transcriber
from IPython.display import Audio, display

model_name = "seamlessM4T_v2_large"
vocoder_name = "vocoder_v2" if model_name == "seamlessM4T_v2_large" else "vocoder_36langs"

transcriber = Transcriber (
    model_name,
    device=torch.device("cpu"),
    dtype=torch.float32,
)

txt = transcriber.transcribe(audio="example.wav", src_lang="eng", denoise=True)

print("Translated text: ", txt)
print()