Closed ewagner70 closed 3 months ago
Uneducated guess:
chunk_frames=np.frombuffer(b''.join(frames), dtype=np.float32) / 32768.0
Uneducated guess:
chunk_frames=np.frombuffer(b''.join(frames), dtype=np.float32) / 32768.0
no, would have wondered if that would work ... did you try it on your own? both codes should work out-of-the-box with the 2nd one producing ok results, but the combination of the javascript/python-backend only recognizes "Bye" although never mentioned. It seems like that
Try to save audio to a file and look at the differences.
...did you try it on your own?
Of course not. Uneducated guess - An arbitrary guess with no particular reasoning behind it.
Try to save audio to a file and look at the differences.
that's tricky as I use microphone and all I get is a byte-stream of float32 structure. It is obviously hard to find a solution, which streams audio from a web-page to a python-backend via websockets (or anything else like SignalR, etc.). Either I'm blind/too stupid for google searches or there is literally not a single (based on non-deprecated libs/functions) example in the internet. I even tried ChatGPT, but also - nothing ...
Please, let me know if you will find a solution. I have the same problem.
@ewagner70 the reason behind 32768.0
is to normalize array value between -1 & 1
it's a specification on how to represent audio in numerical array
@ewagner70 the reason behind
32768.0
is to normalize array value between -1 & 1it's a specification on how to represent audio in numerical array
the values in the float32array are already between -1 and +1. They also arrive at the python-backend as such ... Is it a different encoding (big/little endian or pcm codec or ... ???)
Hi @ewagner70 , do you have updates ?
Hi @ewagner70 , do you have updates ?
@anbzerc : unfortunately no update as I am at the end of my wisdom ... even the faster_whisper guys obviously don't know what's the difference between the javascript and python libs ... it would be really helpful, if one of the colleagues provide a sample code where javascript picks up the audio-chunks and transfer them to a python-backend for faster_whisper transcription ... the issue that obviously no one (?!) can resolve that is hindering many use cases where more than one person shall use such a solution.
:sob: . Thanks for the answer
@ewagner70 @anbzerc You can try to use web RTC with aiortc on the Python backend. AioRTC handles the conversion of raw audio packets to av.AudioFrame (from PyAV). Because with WebSockets you need to handle this conversion by your self. You need to know about codecs, bitrate and so on.
@ewagner70 @anbzerc You can try to use web RTC with aiortc on the Python backend. AioRTC handles the conversion of raw audio packets to av.AudioFrame (from PyAV). Because with WebSockets you need to handle this conversion by your self. You need to know about codecs, bitrate and so on.
@Spiritcow : I am not struggling with the data-transfer (this is working). I am struggling with the conversion - that's what I'm looking for, as the data format is nowhere described ... Do you have any pointer on that as well?
@anbzerc : Did you make any progress?
Not yet unfortunately :cry:
This example should be interesting : https://github.com/aiortc/aiortc/tree/main/examples/server I'll try it as soon as possible
This example should be interesting : https://github.com/aiortc/aiortc/tree/main/examples/server I'll try it as soon as possible
@anbzerc : this example uses ICE server ... when you solve it without using ICE, but directly via winsocket or similar - let us all know!
@ewagner70 That's why I propose using web RTC with AIO because it does the conversion. Try to learn codecs and audio formats, if you want to use just play sockets
@ewagner70 if you save the file to disk and pass it to the model, then the transcription is ok
fname = r"C:\test.wav"
sig = np.frombuffer(b''.join(frames), dtype=np.float32)
sf.write(fname, sig, 16000, format="wav")
segments, info = model.transcribe(fname, language="en")
text = " ".join([segment.text.strip() for segment in segments])
can you try:
import soundfile as sf
f = io.BytesIO()
sf.write(f, sig, 16000, format="wav")
f.seek(0)
segments, _ = model.transcribe(f, language="en")
text = " ".join([segment.text.strip() for segment in segments])
f.close()
@ewagner70 if you save the file to disk and pass it to the model, then the transcription is ok
fname = r"C:\test.wav" sig = np.frombuffer(b''.join(frames), dtype=np.float32) sf.write(fname, sig, 16000, format="wav") segments, info = model.transcribe(fname, language="en") text = " ".join([segment.text.strip() for segment in segments])
can you try:
import soundfile as sf f = io.BytesIO() sf.write(f, sig, 16000, format="wav") f.seek(0) segments, _ = model.transcribe(f, language="en") text = " ".join([segment.text.strip() for segment in segments]) f.close()
Dear @ldolegowski92, I know that saving as wav file is working (the 42byte wave header as "only" distinction between raw audio chunks and a wav-file).
However, my attempt was to use one of the two other transcribe formats (BinaryIO or ndarray) to send the received audio chunks directly to transcribe ...
Greeting,
My guess is that the data obtained from the microphone is 44100 or 48K, while the model supports 16K, so you got a strange output. And pyaudio should have used resampling to resample the received data to 16K, so his program runs smoothly.
Greeting,
My guess is that the data obtained from the microphone is 44100 or 48K, while the model supports 16K, so you got a strange output. And pyaudio should have used resampling to resample the received data to 16K, so his program runs smoothly.
Thank you, @oyang886 - finally after 6 months someone who solved it. I had to merely adapt my html with the following downsampling code:
// Create an AudioContext object
var audioContext = new AudioContext();
fromSampleRate = audioContext.sampleRate;
toSampleRate = 16000;
// Create a ScriptProcessorNode object with a buffer size of 4096 and one input and output channel
var processor = audioContext.createScriptProcessor(4096, 1, 1);
// Define a function that will be called when the processor has audio data available
processor.onaudioprocess = function (event) {
// Get the input audio data as a Float32Array
var input = event.inputBuffer.getChannelData(0);
// Send the output audio data as a binary message to the server
socket.send(downsample(input, fromSampleRate, toSampleRate));
};
and then simply add another function:
function downsample(buffer, fromSampleRate, toSampleRate) {
// buffer is a Float32Array
var sampleRateRatio = Math.round(fromSampleRate / toSampleRate);
var newLength = Math.round(buffer.length / sampleRateRatio);
var result = new Float32Array(newLength);
var offsetResult = 0;
var offsetBuffer = 0;
while (offsetResult < result.length) {
var nextOffsetBuffer = Math.round((offsetResult + 1) * sampleRateRatio);
var accum = 0, count = 0;
for (var i = offsetBuffer; i < nextOffsetBuffer && i < buffer.length; i++) {
accum += buffer[i];
count++;
}
result[offsetResult] = accum / count;
offsetResult++;
offsetBuffer = nextOffsetBuffer;
}
return result;
}
can be closed now ... the documentation is really sub-par and nothing for the weary ...
I would like to use the microphone from webbrowser and send the audio chunks in realtime to a python backend. The data sent (float32arrays) are sent, but seem to be different than the one created by pyaudio.
here sample code for webfrontend
here the corresponding python backend for faster_whisper
as comparison, the following standalone python code with pyaudio works as expeced ... so I don't really know why the data streamed from javascript frontend doesn't seem to be properly recognized (no errors are generated, btw)