freddyaboulton / gradio-webrtc

MIT License
81 stars 11 forks source link

how to determine_pause in ReplyOnPause #21

Closed liaoweiguo closed 3 days ago

liaoweiguo commented 5 days ago

I cannot find the logic for detect user pause.

BTW, I want to control the period between user pause, default setting seems too sensitive

    def determine_pause(
        self, audio: np.ndarray, sampling_rate: int, state: AppState
    ) -> bool:
        """Take in the stream, determine if a pause happened"""
        duration = len(audio) / sampling_rate

        if duration >= self.algo_options.audio_chunk_duration:
            dur_vad = self.model.vad((sampling_rate, audio), self.model_options)
            logger.debug("VAD duration: %s", dur_vad)
            if (
                dur_vad > self.algo_options.started_talking_threshold
                and not state.started_talking
            ):
                state.started_talking = True
                logger.debug("Started talking")
            if state.started_talking:
                if state.stream is None:
                    state.stream = audio
                else:
                    state.stream = np.concatenate((state.stream, audio))
            state.buffer = None
            if dur_vad < self.algo_options.speech_threshold and state.started_talking:
                return True
        return False
liaoweiguo commented 5 days ago

seems algo_options.audio_chunk_duration means the time of pause, not the total audio, I'm not sure

liaoweiguo commented 4 days ago

how to make it less sensitive to detect a speech, I set speech_threshold=0.2, still get a lot of empty input

freddyaboulton commented 4 days ago

Hi @liaoweiguo - I prepared some docs on this: https://freddyaboulton.github.io/gradio-webrtc/advanced-configuration/#reply-on-pause-voice-activity-detection

audio_chunk_duration is the chunk size used to run VAD. By default its 0.6 seconds. So if you set speech_threshold=0.2, it means that if a chunk has less than 33.33% of voice activity (0.2/0.6) it will be a pause. I will set it to be lower for your case. Also you can try setting started_speaking_threshold to be higher?

liaoweiguo commented 4 days ago

this seems better:

            fn=ReplyOnPause(
                response, output_sample_rate=OUT_RATE, output_frame_size=480, algo_options=AlgoOptions(audio_chunk_duration=0.6,started_talking_threshold=0.3,speech_threshold=0.2),
            ),
freddyaboulton commented 3 days ago

Nice!