Hey guys. Really appreciate the project. I'm really new to Whisper and Python but have a fair amount of coding background in other languages. Wondering if you could provide any strategy ideas or an outline on the best way to approach the below.
I've got an existing websocket server implementation that accepts a websocket connection from Twilio
Here is my existing websocket proof of concept that accepts an incoming stream fine and I can transcribe using whisper_cpp after the stream has completed. I'm looking to get realtime transcription working though if possible.
@app.websocket("/stream")
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
audio_bytes_buffer = bytearray()
try:
while True:
message = await websocket.receive_text()
packet = json.loads(message)
if packet["event"] == "start":
print("Streaming is starting")
elif packet["event"] == "stop":
print("\nStreaming has stopped")
# global accumulated_audio, accumulated_frames
# accumulated_audio = bytearray() # Reset accumulated_audio
# accumulated_frames = [] # Reset accumulated_frames
break
elif packet["event"] == "media":
audio = bytes.fromhex(packet["media"]["payload"])
audio = audioop.ulaw2lin(audio, 2)
audio = audioop.ratecv(audio, 2, 1, 8000, 16000, None)[0]
audio_bytes_buffer.extend(audio)
# Append the processed audio to the audio buffer for asynchronous processing
audio_buffer.append(audio)
# length of audio_bytes_buffer in seconds
length_in_seconds = len(audio_bytes_buffer) / BYTES_IN_1_MS / 1000
logger.info(f"audio_bytes_buffer seconds: {length_in_seconds}")
# Schedule background task for transcription
asyncio.create_task(execute_transcription(model, audio_bytes_buffer))
# SAVE COMPLETE AUDIO FILE
filename = f"99_complete_audio.wav"
length_in_seconds = len(audio_bytes_buffer) / BYTES_IN_1_MS / 1000
print(f"Saving {filename} seconds: {length_in_seconds}")
asyncio.create_task(execute_save_segment(audio_bytes_buffer, filename))
except Exception as e:
print(f"WebSocket closed unexpectedly: {e}")
What I'm wondering is what would be the best way to send the live streaming audio data to the server? Would it make sense to create a new websocket server to listen for incoming Twilio stream data and then send that to the TwilioClient somehow. Thinking of modifying the record method to handle incoming audio data instead of recording from the mic. Any feedback would be greatly appreciated.
Hey guys. Really appreciate the project. I'm really new to Whisper and Python but have a fair amount of coding background in other languages. Wondering if you could provide any strategy ideas or an outline on the best way to approach the below.
I've got an existing websocket server implementation that accepts a websocket connection from Twilio
The websocket media messages look like this:
source: https://www.twilio.com/docs/voice/twiml/stream#websocket-messages-from-twilio
Here is my existing websocket proof of concept that accepts an incoming stream fine and I can transcribe using whisper_cpp after the stream has completed. I'm looking to get realtime transcription working though if possible.
What I'm wondering is what would be the best way to send the live streaming audio data to the server? Would it make sense to create a new websocket server to listen for incoming Twilio stream data and then send that to the TwilioClient somehow. Thinking of modifying the record method to handle incoming audio data instead of recording from the mic. Any feedback would be greatly appreciated.
cheers!