KoljaB / RealtimeSTT

A robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription.
MIT License
2.09k stars 190 forks source link

Why setting up the audio input stream for recording at the highest possible sample rate? #123

Open sangheonEN opened 1 month ago

sangheonEN commented 1 month ago
    def get_highest_sample_rate(audio_interface, device_index):
        """Get the highest supported sample rate for the specified device."""
        try:
            device_info = audio_interface.get_device_info_by_index(device_index)
            max_rate = int(device_info['defaultSampleRate'])

            if 'supportedSampleRates' in device_info:
                supported_rates = [int(rate) for rate in device_info['supportedSampleRates']]
                if supported_rates:
                    max_rate = max(supported_rates)

            return max_rate
        except Exception as e:
            logging.warning(f"Failed to get highest sample rate: {e}")
            return 48000  # Fallback to a common high sample rate
    def initialize_audio_stream(audio_interface, device_index, sample_rate, chunk_size):
        """Initialize the audio stream with error handling."""
        try:
            stream = audio_interface.open(
                format=pyaudio.paInt16,
                channels=1,
                rate=sample_rate,
                input=True,
                frames_per_buffer=chunk_size,
                input_device_index=device_index,
            )
            return stream
        except Exception as e:
            logging.error(f"Error initializing audio stream: {e}")
            raise
    def preprocess_audio(chunk, original_sample_rate, target_sample_rate):
        """Preprocess audio chunk similar to feed_audio method."""
        if isinstance(chunk, np.ndarray):
            # Handle stereo to mono conversion if necessary
            if chunk.ndim == 2:
                chunk = np.mean(chunk, axis=1)
            # Resample to target_sample_rate if necessary
            if original_sample_rate != target_sample_rate:
                num_samples = int(len(chunk) * target_sample_rate / original_sample_rate)
                chunk = signal.resample(chunk, num_samples)
            # Ensure data type is int16
            chunk = chunk.astype(np.int16)
        else:
            # If chunk is bytes, convert to numpy array
            chunk = np.frombuffer(chunk, dtype=np.int16)
            # Resample if necessary
            if original_sample_rate != target_sample_rate:
                num_samples = int(len(chunk) * target_sample_rate / original_sample_rate)
                chunk = signal.resample(chunk, num_samples)
                chunk = chunk.astype(np.int16)
        return chunk.tobytes()
    audio_interface = None
    stream = None
    device_sample_rate = None
    chunk_size = 1024  # Increased chunk size for better performance

I'm curious about the need for the member functions of the added _audio_data_worker function. Why do we need as high a sample rate as possible?

KoljaB commented 1 month ago

Already changed here to start with 16000. After that highest possible sample rate because we need to downsample to 16000 and I assume quality for that to be best with highest sampled original material.

sangheonEN commented 1 month ago

Already changed here to start with 16000. After that highest possible sample rate because we need to downsample to 16000 and I assume quality for that to be best with highest sampled original material.

Oh, so you're saying that the sample rate starts at 16,000, but you're also considering higher sample rate values?

KoljaB commented 1 month ago

Yes. The reason is the following: we want to record from the microphone with 16000 Hz very much because this is the sample rate all VAD detection algorithms and the whisper model is working on. So if possible we record in 16000 so we don't run into losses due to downsampling. If this does not work because the sound card is not supporting recording in 16000 Hz, then we want to record in the highest possible quality because downsampling from like 22000 Hz to 16000 Hz might result in chunks that are bad quality and thus can't be processed well enough especially for Silero VAD, which is sensible towards imperfections in chunks.