Twilio Voice to Text Implementation?

Hey guys. Really appreciate the project. I'm really new to Whisper and Python but have a fair amount of coding background in other languages. Wondering if you could provide any strategy ideas or an outline on the best way to approach the below.

I've got an existing websocket server implementation that accepts a websocket connection from Twilio

The websocket media messages look like this:

{ 
 "event": "media",
 "sequenceNumber": "4",
 "media": { 
   "track": "inbound", 
   "chunk": "2", 
   "timestamp": "5",
   "payload": "no+JhoaJjpzSHxAKBgYJDhtEopGKh4aIjZm7JhILBwYIDRg1qZSLh4aIjJevLBUMBwYHDBUsr5eMiIaHi5SpNRgNCAYHCxImu5mNiIaHipGiRBsOCQYGChAf0pyOiYaGiY+e/x4PCQYGCQ4cUp+QioaGiY6bxCIRCgcGCA0ZO6aSi4eGiI2YtSkUCwcGCAwXL6yVjIeGh4yVrC8XDAgGBwsUKbWYjYiGh4uSpjsZDQgGBwoRIsSbjomGhoqQn1IcDgkGBgkPHv+ej4mGhomOnNIfEAoGBgkOG0SikYqHhoiNmbsmEgsHBggNGDWplIuHhoiMl68sFQwHBgcMFSyvl4yIhoeLlKk1GA0IBgcLEia7mY2IhoeKkaJEGw4JBgYKEB/SnI6JhoaJj57/Hg8JBgYJDhxSn5CKhoaJjpvEIhEKBwYIDRk7ppKLh4aIjZi1KRQLBwYIDBcvrJWMh4aHjJWsLxcMCAYHCxQptZiNiIaHi5KmOxkNCAYHChEixJuOiYaGipCfUhwOCQYGCQ8e/56PiYaGiY6c0h8QCgYGCQ4bRKKRioeGiI2ZuyYSCwcGCA0YNamUi4eGiIyXrywVDAcGBwwVLK+XjIiGh4uUqTUYDQgGBwsSJruZjYiGh4qRokQbDgkGBgoQH9KcjomGhomPnv8eDwkGBgkOHFKfkIqGhomOm8QiEQoHBggNGTumkouHhoiNmLUpFAsHBggMFy+slYyHhoeMlawvFwwIBgcLFCm1mI2IhoeLkqY7GQ0IBgcKESLEm46JhoaKkJ9SHA4JBgYJDx7/no+JhoaJjpzSHxAKBgYJDhtEopGKh4aIjZm7JhILBwYIDRg1qZSLh4aIjJevLBUMBwYHDBUsr5eMiIaHi5SpNRgNCAYHCxImu5mNiIaHipGiRBsOCQYGChAf0pyOiYaGiY+e/x4PCQYGCQ4cUp+QioaGiY6bxCIRCgcGCA0ZO6aSi4eGiI2YtSkUCwcGCAwXL6yVjIeGh4yVrC8XDAgGBwsUKbWYjYiGh4uSpjsZDQgGBwoRIsSbjomGhoqQn1IcDgkGBgkPHv+ej4mGhomOnNIfEAoGBgkOG0SikYqHhoiNmbsmEgsHBggNGDWplIuHhoiMl68sFQwHBgcMFSyvl4yIhoeLlKk1GA0IBgcLEia7mY2IhoeKkaJEGw4JBgYKEB/SnI6JhoaJj57/Hg8JBgYJDhxSn5CKhoaJjpvEIhEKBwYIDRk7ppKLh4aIjZi1KRQLBwYIDBcvrJWMh4aHjJWsLxcMCAYHCxQptZiNiIaHi5KmOxkNCAYHChEixJuOiYaGipCfUhwOCQYGCQ8e/56PiYaGiY6c0h8QCgYGCQ4bRKKRioeGiA=="                        
 },
"streamSid": "MZ18ad3ab5a668481ce02b83e7395059f0" 
}

source: https://www.twilio.com/docs/voice/twiml/stream#websocket-messages-from-twilio

Here is my existing websocket proof of concept that accepts an incoming stream fine and I can transcribe using whisper_cpp after the stream has completed. I'm looking to get realtime transcription working though if possible.


@app.websocket("/stream")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    audio_bytes_buffer = bytearray()
    try:
        while True:
            message = await websocket.receive_text()
            packet = json.loads(message)
            if packet["event"] == "start":
                print("Streaming is starting")
            elif packet["event"] == "stop":
                print("\nStreaming has stopped")
                # global accumulated_audio, accumulated_frames
                # accumulated_audio = bytearray()  # Reset accumulated_audio
                # accumulated_frames = []  # Reset accumulated_frames
                break
            elif packet["event"] == "media":
                audio = bytes.fromhex(packet["media"]["payload"])
                audio = audioop.ulaw2lin(audio, 2)
                audio = audioop.ratecv(audio, 2, 1, 8000, 16000, None)[0]
                audio_bytes_buffer.extend(audio)

                # Append the processed audio to the audio buffer for asynchronous processing
                audio_buffer.append(audio)

        # length of audio_bytes_buffer in seconds
        length_in_seconds = len(audio_bytes_buffer) / BYTES_IN_1_MS / 1000
        logger.info(f"audio_bytes_buffer seconds: {length_in_seconds}")

        # Schedule background task for transcription
        asyncio.create_task(execute_transcription(model, audio_bytes_buffer))

        # SAVE COMPLETE AUDIO FILE
        filename = f"99_complete_audio.wav"
        length_in_seconds = len(audio_bytes_buffer) / BYTES_IN_1_MS / 1000
        print(f"Saving {filename} seconds: {length_in_seconds}")
        asyncio.create_task(execute_save_segment(audio_bytes_buffer, filename))

    except Exception as e:
        print(f"WebSocket closed unexpectedly: {e}")

What I'm wondering is what would be the best way to send the live streaming audio data to the server? Would it make sense to create a new websocket server to listen for incoming Twilio stream data and then send that to the TwilioClient somehow. Thinking of modifying the record method to handle incoming audio data instead of recording from the mic. Any feedback would be greatly appreciated.

cheers!

collabora / WhisperLive

Twilio Voice to Text Implementation? #40