Rethink text-to-speech (voice memo) ordering

Right now, when text-to-speech voice memos are enabled, the user has to wait until the ChatGPT text response is round-tripped to Azure's text-to-speech API before they even get to read any text.

This adds about 2-6 seconds per response. Sometimes the Azure API even inexplicably hangs for 10+ seconds before finally responding (TODO: Race it with 8 second timeout or something).

While I like the idea of responding with a voice memo that chains into the text response, I don't want to hang the UX for 2-6 seconds.

Ideas:

Send the audio last: This would let me send the text as it's ready. I guess it could be jarring if the voice memo comes in a while later after the user has sent additional messages.
Update the first message to add audio: I don't think this is possible since a regular text message (sendMessage) and a voice message (sendVoice) are two different concepts in Telegram.

danneu / telegram-chatgpt-bot

Rethink text-to-speech (voice memo) ordering #1