App chat response is very slow

anonymous-program commented 3 months ago

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Write question in the web app chat

Any log messages given by the failure

None

Expected/desired behavior

We are expecting a seamless and intuitive response with fluidity and speed just like a regular LLM interface or GenAI application.

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

anonymous-program commented 3 months ago

I would like to understand which service needs to improve to have a chat bot with a faster response. Also, if there is file code to edit or change please mention it in the comment.

pamelafox commented 3 months ago

The speed of the response will be based on the speed of the various services. For the chat tab, that involves:

Call to ChatCompletion API to turn user query into keyword query
Call to embedding API to turn keyword query into vector
Call to search API to get results matching query+vector
Call to ChatCompletion API to answer questions

The ask tab only has the last three steps, so it may be faster.

You can use Azure Monitor to see the performance of those steps. See https://github.com/Azure-Samples/azure-search-openai-demo/?tab=readme-ov-file#monitoring-with-application-insights

You will likely find that the final step is the slowest, since LLMs require a lot of computing power. Azure does not give latency guarantees for the "Pay-as-you-go" subscriptions but does give them for PTU (Provisioned Thoroughput Units) so that is what many customers use. If that is beyond your budget, then you'll need to try other ways of reducing the time taken, like using the simpler "Ask" tab. If other LLM applications are faster, then they may be using a dedicated GPU for reduced latency.

Also note that you'll see different latency for different models (4 slower than 3.5) and for different regions, so you can experiment with that.

anonymous-program commented 3 months ago

thank you for the guidance.

Azure-Samples / azure-search-openai-demo