Open Irvingao opened 1 month ago
For your first idea, I think the asr example have done it.
For your first idea, I think the asr example have done it.
I main speech inputs with LLM outputs.
For your first idea, I think the asr example have done it.
I main speech inputs with LLM outputs.
Your "text" means response, right?
For your first idea, I think the asr example have done it.
I main speech inputs with LLM outputs.
Your "text" means response, right? Though not fully understand
Exactly.
Are you talking about ASR for the speech-to-text task? If so, you can try our ASR example.
We may support speech-to-speech in the future, but as this task is much more difficult than ASR or TTS, it is more like combining these two seamlessly. Thank you for your advice; we will take it into consideration.
If you have any further questions or need additional assistance, feel free to ask!
🚀 The feature, motivation and pitch
As we all know, GPT-4o is an end2end multi-modal models, which support Speech to Text/Speech. I have some ideas about it:
Alternatives
No response
Additional context
No response