redis client connections

sdarwin commented 1 month ago

If you search the website codebase, there is a third-party plugin "health_check" in INSTALLED_APPS. https://pypi.org/project/django-health-check/

Last week, I set up external monitoring so every 5 minutes the health check is contacted. It returns HTTP 200 if all checks passed and HTTP 500 if any of the tests failed. Perfect. The checks were all passing.

After 5 days, an outage. It appears that every health check opens a connection to the redis memorystore and does not close it. There were 44,000 open connections to redis. It can crash the app.

What is unknown, is whether this bug is specific to django-health-check, and doesn't affect the rest of the website, or if it has uncovered a problem of the website itself that could show up later with many visitors.

Usually databases have connection pooling that limits the number of connections. What about celery?

I will open an Issue on django-health-check also.

kennethreitz commented 1 month ago

Nice catch! I'll look to see if there's anything we can do on the code side to fix this.

sdarwin commented 1 month ago

Ok. A correction, the prometheus default check time is 10 seconds, so that accounts for the large number after 5 days.

kennethreitz commented 1 month ago

Celery broker settings can be configured like so:

BROKER_TRANSPORT_OPTIONS = {
   'max_connections': 30
}

Did we determine if the connections are coming from celery itself, or from your health check in prometheus (sounds likely?)

sdarwin commented 1 month ago

'max_connections': 30

Yes, let's try that.

The connections are not directly from prometheus itself.

The monitoring system contacts the healthcheck webpage.

In turn, django opens a redis connection. On the web pods, the connections to redis were observed to be originating from a python process (that is, Celery, or Django).

kennethreitz commented 1 month ago

https://github.com/boostorg/website-v2/pull/1164

kennethreitz commented 1 month ago

@sdarwin now that #1164 is merged (currently deploying), let's test the connections and see if this resolves the problem.

sdarwin commented 1 month ago

root@boost-6549cf5b7c-r2pxt:/code/config# netstat -anptu | grep 6379 | tail -n 10
tcp        0      0 10.38.128.156:49152     10.67.224.3:6379        ESTABLISHED 7/python3           
tcp       86      0 10.38.128.156:48164     10.67.224.3:6379        ESTABLISHED 7/python3           
tcp       86      0 10.38.128.156:57870     10.67.224.3:6379        ESTABLISHED 7/python3           
tcp        0      0 10.38.128.156:43756     10.67.224.3:6379        ESTABLISHED 7/python3           
tcp        0      0 10.38.128.156:44022     10.67.224.3:6379        ESTABLISHED 7/python3           
tcp        0      0 10.38.128.156:42024     10.67.224.3:6379        ESTABLISHED 7/python3           
tcp        0      0 10.38.128.156:52698     10.67.224.3:6379        ESTABLISHED 7/python3           
tcp       86      0 10.38.128.156:53522     10.67.224.3:6379        ESTABLISHED 7/python3           
tcp      258      0 10.38.128.156:36664     10.67.224.3:6379        ESTABLISHED 7/python3           
tcp        0      0 10.38.128.156:51390     10.67.224.3:6379        ESTABLISHED 7/python3           
root@boost-6549cf5b7c-r2pxt:/code/config#

root@boost-6549cf5b7c-r2pxt:/code# netstat -anptu | grep 6379 | wc
   1107    7749  111807

1107 and climbing steadily every 10-15 seconds.

Not necessarily the same: "Celery does not honor max_connections_redis or max_connections" https://github.com/celery/celery/issues/ 4994

sdarwin commented 1 month ago

Which process is it? PID 7

root@boost-6549cf5b7c-r2pxt:/code/config# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 18:55 ?        00:00:01 /venv/bin/python3 /venv/bin/gunicorn -c /code/gunicorn.conf.py config.wsgi
root           7       1  2 18:55 ?        00:03:47 /venv/bin/python3 /venv/bin/gunicorn -c /code/gunicorn.conf.py config.wsgi
root         712       0  0 21:43 pts/0    00:00:00 bash
root        1084     712  0 21:53 pts/0    00:00:00 ps -ef
root@boost-6549cf5b7c-r2pxt:/code/config#

# Configure Celery
CELERY_BROKER_URL = f"redis://{REDIS_HOST}:6379"
CELERY_RESULT_BACKEND = f"redis://{REDIS_HOST}:6379"
CELERY_ACCEPT_CONTENT = ["application/json"]
CELERY_TASK_SERIALIZER = "json"
CELERY_RESULT_SERIALIZER = "json"
CELERY_TIMEZONE = "UTC"
CELERY_BROKER_TRANSPORT_OPTIONS = {
    "max_connections": env.int("MAX_CELERY_CONNECTIONS", default=60)
}

sdarwin commented 3 weeks ago

an idea to debug:

fork the celery and django-health-check pip packages to a cppalliance repo. (temporary)
Point website-v2 to the cppalliance versions of the packages (temporary)
"instrument" the code with additional logging: 1. when a celery connection opens. 2. when a celery connection closes. Since the closing is known to be buggy, and it's not closing the connections, add debug logging in a few other places. Try to pinpoint where it breaks.

frankwiles commented 1 week ago

We've not run into this situation with these packages/versions and have a lot of long running celery out in the wild FYI. Maybe try setting this to recycle workers with a low setting for awhile to see if it is celery or django-health-check?

https://docs.celeryq.dev/en/stable/userguide/configuration.html#worker-max-tasks-per-child

sdarwin commented 1 week ago

Hi Frank, the issue doesn't appear by

having celery installed
also django-health-check installed and configured

Since that is in-place now on boost.io without problems. The issues first appeared when... I pointed prometheus monitoring at the healthcheck endpoint, polling every 15 seconds. That resulted in 1000's of connections over the course of time. Your comment about worker-max-tasks-per-child adds insight. Consider what a celery worker looks like:

root           1       0  0 Aug08 ?        00:07:50 /venv/bin/python3 /venv/bin/celery -A config worker --concurrency=5 --loglevel=debug

and next, compare that to a django server process:

root           1       0  0 Aug08 ?        00:00:47 /venv/bin/python3 /venv/bin/gunicorn -c /code/gunicorn.conf.py config.wsgi

The long running connections are the latter. The gunicorn processes. So maybe adjusting a celery config would not affect it.

Here is an idea. Could it be that the health check has two steps:

Write a value into redis (using celery?)
Read the value from redis (directly accomplished by django or django-health-check?)

The open connections would correspond to the second part. Is django or django-health-check directly reading from redis, opening a redis connection, (but then not closing it)

boostorg / website-v2

redis client connections #1144