Gunicorn worker hangs and closes connections

dantebarba commented 1 month ago

Hi

I've been dealing with this issue since we moved our application from flask development application server to wsgi and I'm unable to find a solution to it

Runtime environment

python==3.7.16
gunicorn==23.0.0
Flask==1.1.2
Docker: yes
Docker image: python:3.7.16-slim
VM: GCP e2-small with ContainerOS

Dockerfile

FROM python:3.7.16-slim

RUN apt update && apt install -y build-essential libssl-dev libffi-dev

WORKDIR /app

COPY requirements.txt requirements.txt

RUN pip install pip==20.0.1 && pip install -r requirements.txt

ARG VERSION=""
ARG BUILD_TIMESTAMP=""
ARG BUILD_ENVIRONMENT="test"

ENV APP_SETTINGS="config.StagingConfig"
ENV FLASK_APP=create_app.py
ENV FLASK_ENV "production"
ENV VERSION $VERSION
ENV BUILD_TIMESTAMP $BUILD_TIMESTAMP
ENV BUILD_ENVIRONMENT $BUILD_ENVIRONMENT
ENV LOG_LEVEL "INFO"
ENV EMAIL_USE_SSL "True"
ENV REDIS_URL "redis-node"
# setting max worker timeout to match Cloudflare max timeout
ENV WORKER_TIMEOUT "100"
ENV PYTHONFAULTHANDLER "1"
ENV GRPC_POLL_STRATEGY "epoll1"
# default is 2048
ENV GUNICORN_BACKLOG="2048"

COPY . .

EXPOSE 80

RUN mkdir -p /app/log && touch /app/log/client_library.log

CMD gunicorn --worker-class=gthread --workers=3 --threads=4 wsgi:app --bind 0.0.0.0:80 --timeout ${WORKER_TIMEOUT} --access-logfile /dev/null --error-logfile - --log-level ${LOG_LEVEL} --limit-request-line 4094 --limit-request-fields 100 --limit-request-field_size 8190 --backlog ${GUNICORN_BACKLOG}

Description

We started experiencing some random hangs on the application. We noticed because our uptime monitor would alert us. Downtime usually lasts about 3-5 minutes. We analyzed the logs and found that usually these hanging events are preceded by a request spike.

Our first attempt was to change the worker and threads configuration. We tested various combinations from 1 worker and 1 thread to 8 workers and 2 threads, all of them reported similar issues when doing stress tests. The one that was configured with 1 worker and 1 thread was the fastest to freeze, after only 10 requests.

One of the things that we noticed was that the application would return to life after bursting a bunch of [DEBUG] Closing connection. log entries.

This issue only happens when deploying to a VM, on my local environment (Macbook Air M1) this does not happen, the application can serve multiple requests and all stress tests were successful.

Here is a stress test sample

Any thoughts?

pajod commented 1 month ago

The configuration seems to omit a possible relevant dependency (GRPC_POLL_STRATEGY, libffi-dev and whats up with the EoL Python & pip version?) - probably worth a shot bisecting dependencies to rule out a loaded C module is misbehaving.

dantebarba commented 1 month ago

libffi-dev

The GRPC configuration is due to the following issue with grpcio: https://github.com/grpc/grpc/issues/29044. We use GRPC to connect to GCP services. Libffi was added to support cffi and cryptography packages.

The main issue resides on the fact that on all my local environments the docker image runs perfectly fine. My first assumption was some kind of firewall issue with Cloudflare or our load balancer but it was quickly ruled out since during the stress test if I login into the VM and do a simple curl localhost the application would not respond. So there is nothing blocking the requests. We also have an external redis instance running on GCP but that shouldn't be an issue since the test call doesn't even interact with the cache.

This is a sample from my current local machine. Same results were achieved (but with less performance) on an M1 laptop. Hanging issues only occur on VM.

Local machine sample

VM memory when non-responsive (I can login via ssh though without any issues, even login into the container)

               total        used        free      shared  buff/cache   available
Mem:            1982        1165         317           2         499         673
Swap:              0           0           0

dantebarba commented 1 month ago

Update: switched back to flask development server, did a couple of stress tests and aside from a rate limit ban I didn't have any requests or performance issue.

Fun fact, since flask can process as much as 3 times more requests than gunicorn it made the load balancer rate limiter to kick in.

benoitc / gunicorn