Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.68k stars 1.79k forks source link

Text to Speech Avatar WebRTC Api Need for More Capabilities with Underlying LLM #2452

Open xtianus79 opened 5 days ago

xtianus79 commented 5 days ago

Is your feature request related to a problem? Please describe. Currently, the LLM is tightly controlled in a backend service returning from the underlying WebRTC service. Because of this, there is no way to do advanced capabilities that would be possible otherwise. For example, the returning api is geared towards safety measures (safe true/false) and then a delta > content of the returning bot text.

Describe the solution you'd like If I had control of the underlying api definition I could bring back other metrics or insights for the purposes of running 2 things in the response. 1. additional metrics beyond the safety object 2. Additional output json for things such as topic categorization and placement order of the conversational system, i.e., if I say do these 5 things and they llm is on the 3rd thing I would know as much and be able to make downstream decisions. In another capability, I would be able to retrieve and thus understand information on top of the speakers conversation that is unknown directly to the speaker's own STT - running alongside the user conversation. The purpose of this could be to include additional context or assure the LLM's response is holding to a specific standard for a specific use case. i.e., "in this request if the person asks for ice cream make sure to reply with we have currently ran out of chocolate." Or you could run an NLP and make sure the person is responding to the question put forth by the LLM directly so if they veer off you could make a deterministic decision prior to hitting the LLM. Currently you could do this but the aforementioned control issues prevent this from being practical.

Describe alternatives you've considered I 'could' run a sidecar llm to achieve the same effect but that would be quite costly, lead to increased latency issues, and seemingly unnecessary.

Additional context Using model 3.5 achieves pretty good results. I would expect 4o to have similar results but is nowhere near as tuned and low latency has 3.5.

I would love to help with this more directly if possibly. Do you have TAP webcasts for AI that we could attend? Those were pretty informative and helpful when working with the dev teams directly.

xtianus79 commented 23 hours ago

@HenryvanderVegte Hi are you able to help with this?

yinhew commented 2 hours ago

Hi, @xtianus79

Our TTS Avatar service and Azure OpenAI (LLM) service are different services. They are decoupled. Therefore, using TTS avatar won't limit the capability of Azure OpenAI.

The steps are:

  1. STT (speaker speech -> speaker text)
  2. LLM (speaker text -> response text)
  3. TTS Avatar (response text -> TTS avatar response video/speech)

You can insert any customized logic between 1, 2 and 2, 3.

Thanks, Yinhe

xtianus79 commented 5 minutes ago

@yinhew hi! Thanks for the reply. The issue is with 2 and the ability to control or customize 2. The reason is because 2 is very "curated" in it's responses and structure. You can't alter anything regarding what the LLM is producing. For example, i've don't quick experiments to try and alter the payload and it's impossible. you get a general response but not a fully controllable analysis and response customization.

Would you guys be open to allowing for some customization of that response? For example, if I want a mapping to a "type of response", "Was the response an answer", "was the answer acceptable", etc. Some additional metadata that I can act on for my conversational flow would be very useful.