Ways to transcribe real-time/detect the end of speaking in indefinite file?

kingcharlezz commented 1 year ago

Hello all. I am working on a project pertaining to ASR in phone calls. After being dissatisfied with some of the commercial options, I wanted to try this. Is there a built in way to know when the other party is not talking? or something like whisper.cpp's stream function? I have seen mention of VAD in the docs this but I am not sure how to elegantantly implement this into my problem. Any comments are appreciated.

Thanks.

EtienneAb3d commented 1 year ago

@kingcharlezz, You may have a look at this project: https://github.com/mallorbc/whisper_mic

JonathanFly commented 1 year ago

Hello all. I am working on a project pertaining to ASR in phone calls. After being dissatisfied with some of the commercial options, I wanted to try this. Is there a built in way to know when the other party is not talking? or something like whisper.cpp's stream function? I have seen mention of VAD in the docs this but I am not sure how to elegantantly implement this into my problem. Any comments are appreciated.

Thanks.

I user faster-whisper real time livestream (so infinite duration) and it works great. (Actually more than great, I can actually run two large faster-whisper models simultaneously and get both transcription and translation, it's so fast!)

For the vad, you can pass in vad_filter=True and by default will break look for 2 second silences. (min_silence_duration_ms = 2000)

Also check out the non Vad no_speech_threshold and log_prob_threshold options.

More specific vad options, from vad.py, you just pass these exactly the name names to faster-whisper the values get pass through to the vad:


def get_speech_timestamps(
    audio: np.ndarray,
    *,
    threshold: float = 0.5,
    min_speech_duration_ms: int = 250,
    max_speech_duration_s: float = float("inf"),
    min_silence_duration_ms: int = 2000,
    window_size_samples: int = 1024,
    speech_pad_ms: int = 200,
) -> List[dict]:
    """This method is used for splitting long audios into speech chunks using silero VAD.
    Args:
      audio: One dimensional float array.
      threshold: Speech threshold. Silero VAD outputs speech probabilities for each audio chunk,
        probabilities ABOVE this value are considered as SPEECH. It is better to tune this
        parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
      min_speech_duration_ms: Final speech chunks shorter min_speech_duration_ms are thrown out.
      max_speech_duration_s: Maximum duration of speech chunks in seconds. Chunks longer
        than max_speech_duration_s will be split at the timestamp of the last silence that
        lasts more than 100s (if any), to prevent agressive cutting. Otherwise, they will be
        split aggressively just before max_speech_duration_s.
      min_silence_duration_ms: In the end of each speech chunk wait for min_silence_duration_ms
        before separating it
      window_size_samples: Audio chunks of window_size_samples size are fed to the silero VAD model.
        WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate.
        Values other than these may affect model perfomance!!
      speech_pad_ms: Final speech chunks are padded by speech_pad_ms each side
    Returns:
      List of dicts containing begin and end samples of each speech chunk.
    """

For livestreams the biggest bottleneck in my opinion, after the VAD, is noise reduction. I pipe the live audio through OBS using NVIDIA noise reduction filter before sending it to faster whisper. It's a day or night difference in Whisper performance on audio with lots of background music or noise. For phone calls you can probably get away without doing that though.

kingcharlezz commented 1 year ago

Appreciate this! seems to accomplish what I need it to do. Thanks for the in-depth responses.

lpy-ET commented 1 year ago

user faster-whisper real time livestream (so infinite duration) and it works great. (Actually more than great, I can actually run two large faster-whisper models simultaneously and get both transcription and translation, it's so fast!)

Hi @JonathanFly, could you please give more info on how you proceed to use this "real time livestream" with infinite duration, please?

JonathanFly commented 1 year ago

user faster-whisper real time livestream (so infinite duration) and it works great. (Actually more than great, I can actually run two large faster-whisper models simultaneously and get both transcription and translation, it's so fast!)

Hi @JonathanFly, could you please give more info on how you proceed to use this "real time livestream" with infinite duration, please?

I threw it up here: https://github.com/JonathanFly/faster-whisper-livestream-translator

I kind of left it in a not great state though, but you can get the idea. It's a messy fork of https://github.com/fortypercnt/stream-translator

SYSTRAN / faster-whisper

Ways to transcribe real-time/detect the end of speaking in indefinite file? #151