Open sangheonEN opened 1 month ago
Already changed here to start with 16000. After that highest possible sample rate because we need to downsample to 16000 and I assume quality for that to be best with highest sampled original material.
Already changed here to start with 16000. After that highest possible sample rate because we need to downsample to 16000 and I assume quality for that to be best with highest sampled original material.
Oh, so you're saying that the sample rate starts at 16,000, but you're also considering higher sample rate values?
Yes. The reason is the following: we want to record from the microphone with 16000 Hz very much because this is the sample rate all VAD detection algorithms and the whisper model is working on. So if possible we record in 16000 so we don't run into losses due to downsampling. If this does not work because the sound card is not supporting recording in 16000 Hz, then we want to record in the highest possible quality because downsampling from like 22000 Hz to 16000 Hz might result in chunks that are bad quality and thus can't be processed well enough especially for Silero VAD, which is sensible towards imperfections in chunks.
I'm curious about the need for the member functions of the added _audio_data_worker function. Why do we need as high a sample rate as possible?