Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
5.81k stars 3.94k forks source link

Website dies randomly when asking questions #272

Closed kikaragyozov closed 5 months ago

kikaragyozov commented 1 year ago

I've deployed the project to azure as instructed by using azd up. I did not use any previous azure resources. Everything was made by the supplied scripts in the repository.

What's going on? Inspecting the Live stream logs of the backend application, I don't see anything special. But basically, the azure website stops responding at certain points in time, specifically after asking a question in chat. I'm not sure what's going on behind the scenes.

tickx-cegeka commented 1 year ago

I have the same problem today! I did not see this behavior before.

I cannot find any regularity in this. It is not necessarily when spamming the chatbot (can also happen at the 4th question of that day). It's also not with any specific large or weird requests. I can't reproduce it on purpose, but it happened several times today.

pamelafox commented 1 year ago

Hmm. Have you tried opening the Network console in the browser to see if an HTTP request is going through to the backend? A successful request looks like:

Screenshot 2023-06-05 at 3 03 20 PM
kikaragyozov commented 1 year ago

Hmm. Have you tried opening the Network console in the browser to see if an HTTP request is going through to the backend? A successful request looks like: Screenshot 2023-06-05 at 3 03 20 PM

After inspecting the HTTP Server Errors, I noticed a few 504 Gateway Timeout errors and 500, but I can't find any information as to why they were produced.

Yes, many of the requests to the /chat endpoint end up with a HTTP Status code of 200 (OK). But there comes a time where either:

Could anyone provide more insights as to what's happening? Or any possible workarounds to this?

mbrenigjones commented 1 year ago

Hi @kikaragyozov, After I deployed the template I saw somewhat similar issues: simultaneous requests took a long time to return, and then returned in sequence as if there was only one thread serving them. I didn't see the 500 or 504 errors you mention.

I configured diagnostic settings on the App-Service to send logs to a Log Analytics workspace [1]. In those logs I found that when the App-Service starts up gunicorn (the WSGI server that App-Service uses to serve Python apps) is using synchronous workers and only starting one worker thread:

image

I'm still learning about App-Service but if you're seeing the same as me, then it may be possible to change gunicorn configuration [2] , and there's also an option to scale up the app-service plan.

Finally, you might want to add timeouts to API calls. E.g. adding a request_timeout parameter to the openai.Completion.create() calls within chatreadretrieveread.py [3]

tickx-cegeka commented 1 year ago

After quite some testing, I can't seem to reproduce the error again today. Edit: later that day I did encounter the issue again unfortunately. Can this be investigated please.

I did also have one very slow request of >2 mins. In the log stream of the web app, I could see this: image

Along with some other requests to assets. I do not know if these are correlated but maybe this leads to another discovery.

kikaragyozov commented 1 year ago

Hi @kikaragyozov, After I deployed the template I saw somewhat similar issues: simultaneous requests took a long time to return, and then returned in sequence as if there was only one thread serving them. I didn't see the 500 or 504 errors you mention.

I configured diagnostic settings on the App-Service to send logs to a Log Analytics workspace [1]. In those logs I found that when the App-Service starts up gunicorn (the WSGI server that App-Service uses to serve Python apps) is using synchronous workers and only starting one worker thread:

image

I'm still learning about App-Service but if you're seeing the same as me, then it may be possible to change gunicorn configuration [2] , and there's also an option to scale up the app-service plan.

Finally, you might want to add timeouts to API calls. E.g. adding a request_timeout parameter to the openai.Completion.create() calls within chatreadretrieveread.py [3]

You just described what I've been dealing with!

Is it possible to use asynchronous workers? Basically having multiple threads serve requests without blocking on I/O?

mbrenigjones commented 1 year ago

Hi @kikaragyozov, glad that's useful.

As this repository is just a simple demo I can understand why it's not async, but I've found you can adapt it very easily.

First, I think this article about Gunicorn is helpful for understanding the options: https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7

It was pretty easy to

  1. Add gevent as a requirement to app/backend/requirements.txt and redeploy the app. I'm using gevent==22.10.2 and no other code changes were required.

  2. Provide a custom Start-up command for the app-service and then restart it. Here's what I'm using (still with the smallest, B1 instance):

gunicorn --bind=0.0.0.0 --timeout=600 --worker-class=gevent --worker-connections=1000 --workers=3 app:app

image

Hope that helps you too.

tickx-cegeka commented 1 year ago

@mbrenigjones Thank you for providing the steps to make it async. However I would like to set that startup command in the bicep configuration, such that we don't have to go to those settings every time. I cannot find where it should be added, can someone figure it out? Thanks

pamelafox commented 1 year ago

You can override the startup command by specifying appCommandLine in the Bicep, as either a filename (pointing at a shell script) or the actual command. For example, in another Flask app, I have this startup command:

https://github.com/pamelafox/flask-db-quiz-example/blob/main/infra/main.bicep#L65

That points at this startup.sh script: https://github.com/pamelafox/flask-db-quiz-example/blob/main/src/startup.sh

As a best practice, I have a gunicorn.conf.py file with the gunicorn configuration, that allows me to vary the workers based off CPU count:

https://github.com/pamelafox/flask-db-quiz-example/blob/main/src/gunicorn.conf.py

kruselegal commented 1 year ago

thanks @kikaragyozov. I've also had the random hanging issue. That patch will help me too to improve the reliability.

Although this is is "just a simple demo" , it's more than enough to build a working tool and iterate upon it. It's certainly the most compete demo I've found that uses Azure. thanks @pablocastro !

Hi @kikaragyozov, glad that's useful. As this repository is just a simple demo I can understand why it's not async, but I've found you can adapt it very easily. ... Hope that helps you too.

pamelafox commented 1 year ago

I just merged a change to the Bicep that sets PYTHON_ENABLE_GUNICORN_MULTIWORKERS to 'true' so that should be a big help.

However, I am also going to send a PR for overriding appCommandLine, as I think it might be worth experimenting with other worker classes (like gevent mentioned here). Ideally we'd do some loadtesting to determine optimal gunicorn configuration.

pamelafox commented 1 year ago

Here's another PR that adds a custom startup script:

https://github.com/Azure-Samples/azure-search-openai-demo/pull/464

You should be able to modify that to change the worker class. Relevant docs here: https://docs.gunicorn.org/en/latest/design.html#choosing-a-worker-type

pamelafox commented 1 year ago

My change is now merged, so you can easily customize the gunicorn configuration. Please do share if you find better settings than the current ones in gunicorn.conf.py.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.