Closed majenkotech closed 8 months ago
If you don't use wake words just change this line in the source of audio_recorder.py:
SAMPLE_RATE = 16000
That should be sufficient in this case.
If you DO use wake words, things get a bit more complicated, since the sample rate get's overwritten by pvporcupine, which handles the wake words. In this case you'd then want to change the sample rate for _audio_data_worker in the source to your desired pipewire sample rate: https://github.com/KoljaB/RealtimeSTT/blob/master/RealtimeSTT/audio_recorder.py#L641-L649 And then resample the recorded chunk back to 16000 Hz before writing it into the audio_queue in line 683 because I guess pvporcupine will complain else. So maybe something like this (untested):
import librosa
import numpy as np
# Convert to float32
audio_chunk = np.frombuffer(
data,
dtype=np.int16
).astype(np.float32) / 32768.0
# resample to desired target rate (for example 40000 Hz)
audio_chunk = librosa.resample(
audio_chunk,
orig_sr=pipewire_samplerate,
target_sr=16000
)
# Convert to int16
scaled_audio = np.clip(audio_chunk * 32768, -32768, 32767)
data = scaled_audio.astype(np.int16)
All that sample rate does is change the error. Now I get:
Exception in thread Thread-4 (_is_silero_speech):
Traceback (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/usr/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/home/matt/.local/lib/python3.11/site-packages/RealtimeSTT/audio_recorder.py", line 1309, in _is_silero_speech
vad_prob = self.silero_vad_model(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/matt/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/matt/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.jit.Error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/vad/model/vad_annotator.py", line 98, in forward
_16 = torch.gt(torch.div(sr1, (torch.size(x2))[1]), 31.25)
if _16:
ops.prim.RaiseException("Input audio chunk is too short", "builtins.ValueError")
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
else:
pass
Traceback of TorchScript, original code (most recent call last):
File "/home/keras/notebook/nvme_raid/adamnsandle/silero-models-research/vad/model/vad_annotator.py", line 364, in forward
if sr / x.shape[1] > 31.25:
raise ValueError("Input audio chunk is too short")
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return x, sr
builtins.ValueError: Input audio chunk is too short
(This happens the moment I make some sound into the microphone btw)
Update:
I increased the buffer size to 2048 and now the error is gone. It doesn't "work" though. I talk into the mic and sometimes I may get a single space character returned, and once in a blue moon it may print a few random unrelated phrases. Once it even started going on about cucumbers, then thanked everyone for watching the video...?!?!
For example, here's the first few paragraphs of the Declaration of Independence:
Yum gosh, oh shikko boshin' super true For kimmy and umm ahh, kakko, uh, uh, uh, uh, uh, uh, uh, uh, shikko, yeah shikko, uh, uh, uh, uh, uh, uh, shikko, uh, uh, shikko, uh, uh, uh, uh, shikko. you. I'm not but trying to get this high, cause I'm very bad at this. But she's so cold, I'm not too much as a bitch, I'm so cold, I'm so cold, I'm so cold, I'm so cold, I'm so cold, I'm so cold. I'm so cold, I'm so cold, I'm so cold. Shh. Oh, I'm sorry, but you don't talk. One like mmm Mmm That's smell Once it's popularity drugs Before DE un complete un un complete Bourj chuk Oh P hes Shin Mmm Ak Shin Shin Shin I forgive. No, I'm not sure. I'm not so much about the sound of the text. That's how it shows you can see how the clock happens to me. To push the clock to push the clock to me. I'm going to take this rock. I'm going to take this rock. I'm going to take this rock. Hahaha What's that? What does that mean? Ah. Okay, okay, okay, okay, okay, okay. I think I should... I can see a few... I should be... I should be... I can't... I can't...
Does not seem like faster-whisper get's clean data chunks from the recording. Also I'm not sure if the underlying VAD models (Silero VAD and WebRTC) are able to handle chunks in other size or format. Maybe it really does not work with sample rates other than 16000 Hz. I'm currently too involved with other projects to find time to look deeper into this.
I'll have to put the project on hold for now then - I know absolutely nothing about Python (and have no desire to learn it...) so can't dig into it myself. NM, it's not important, just an idea I had for my livestreams.
I have a result!
If I set the input to the virtual input PIPEWIRE (22 on my system) then it can use ANY sample rate, not just the one the hardware is configured for! AND IT WORKS!
To find the ID to use I use this little script:
#!/usr/bin/python
import pyaudio
p = pyaudio.PyAudio()
# List all devices
for i in range(p.get_device_count()):
device_info = p.get_device_info_by_index(i)
print(device_info)
# Assuming you've identified the correct device index for input, e.g., 1
device_index = 22 # Replace with the correct index for your device
# Open the selected audio input device with the correct number of channels
stream = p.open(
format=pyaudio.paInt16,
channels=1, # Adjust based on the device's supported input channels
rate=16000,
input=True,
input_device_index=device_index,
frames_per_buffer=1024)
Then look in the output for the pipewire
entry. I should script that to get the ID automatically.
Thanks a lot for letting us know, very much appreciate this!
I do it for purely selfish reasons... I just know that in 6 months time I'll be googling the exact same problem again... ;) The number of times when I've gone to google something that I'm having a problem with and find an answer that I wrote ages ago is quite worrying...
OS: Linux Arch Audio system: Pipewire (with alsa, pulse audio etc plugins).
It seems it is only possible for PyAudio to open an audio device with Pipewire at its default sample rate, not 16000Hz.
Would it be possible to run RealtimeSTT at a higher frequency than 16000Hz?