Improvement in gunicorn container settings

hille721 commented 3 years ago

What is the idea ?

I'm not sure if the current gunicorn settings in the official ara images are really optimized for a container usage:

--workers=4

Starting 4 workers, means 4 processes inside the container, which is a vertical scaling inside the container. But isn't using containers about horizontal scaling? Thus instead of spawn more processes in one container, we would use just more containers.

I found this nice guide: https://pythonspeed.com/articles/gunicorn-in-docker/ and also tried these recommend settings. With them I am able to spawn more containers each with less ressources. Which is in on my container platform (Openshift) much better.

The guide is from 2019 and I'm not a expert in that topic, but maybe here are some who can jump into the discussion :)

dmsimard commented 3 years ago

Hi and thanks for the issue !

To be fair, I must say that there are no claims that the container images published by the project are intended or optimized for production use at a large scale in the docs:

The scripts are designed to yield images that are opinionated and “batteries-included” for the sake of simplicity. They install the necessary packages for connecting to MySQL and PostgreSQL databases and set up gunicorn as the application server.

You are encouraged to use these scripts as a base example that you can build, tweak and improve the container image according to your specific needs and preferences.

For example, precious megabytes can be saved by installing only the things you need and you can change the application server as well as it’s configuration.

That is not to say that we cannot improve the base image we publish but the objective is more about getting people started quickly and then allowing users to tweak on their own by showing them how the sausage is made.

That said, it wouldn't be a bad idea to benchmark different approaches and settings to find out what works best and what doesn't so we can make an informed decision. I personally like gunicorn but there's also uwsgi and other ways to run the application if people really want to.

Edit: links to existing benchmarks:

VannTen commented 1 month ago

It looks like gunicorn and containers don't go very well together

We're currently POCing with ara on kubernetes to record our playbooks runs, using the images provided, and consistently getting WORKER TIMEOUT errors (doing simple curls call with not much data, using sqlite for now (as we're just trying ara))


127.0.0.1 - - [10/Oct/2024:13:14:00 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:14:03 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:14:10 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:14:11 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:14:13 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:14:14 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:14:23 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
[2024-10-10 13:24:16 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:65)
[2024-10-10 13:24:17 +0000] [1] [ERROR] Worker (pid:65) was sent SIGKILL! Perhaps out of memory?
[2024-10-10 13:24:17 +0000] [103] [INFO] Booting worker with pid: 103
[2024-10-10 13:24:52 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:67)
[2024-10-10 13:24:53 +0000] [1] [ERROR] Worker (pid:67) was sent SIGKILL! Perhaps out of memory?
[2024-10-10 13:24:53 +0000] [104] [INFO] Booting worker with pid: 104
[2024-10-10 13:25:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:101)
[2024-10-10 13:25:28 +0000] [1] [ERROR] Worker (pid:101) was sent SIGKILL! Perhaps out of memory?
[2024-10-10 13:25:28 +0000] [105] [INFO] Booting worker with pid: 105
127.0.0.1 - - [10/Oct/2024:13:25:57 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"
127.0.0.1 - - [10/Oct/2024:13:29:19 +0000] "GET / HTTP/1.1" 200 231491 "-" "curl/8.10.1"

hille721 commented 1 month ago

What do you have for gunicorn settings? I have following:

gunicorn --workers=2 --threads=4 --worker-class=gthread --worker-tmp-dir /dev/shm --log-file=- --access-logfile=- --bind 0.0.0.0:8000 ara.server.wsgi

with that ara is running since years on kubernetes

dmsimard commented 1 month ago

Hi @VannTen and thanks for the feedback (also merci for working on kubespray :heart:).

I stand by my previous comment that says we aren't specifically tuning the container images for scale or performance but it should work and if there is anything we can do to make them run better we should consider it.

The recent blog post you shared is interesting and didn't exist when we last looked at this, the take away being:

tl;dr The conventional wisdom to use multiple workers in a containerized instance of Flask/Django/anything that is served with gunicorn is incorrect - you should only use one worker per container, otherwise you’re not properly using the resources allocated to your application. Using multiple workers per container also runs the risk of OOM SIGKILLs without logging, making diagnosis of issues much more difficult than it would be otherwise.

I don't personally have ara deployed in k8s right now but I am willing to work with you to find out if this is true in the context of ara, while putting odds in our favour by doing two more things (that are part of general performance troubleshooting tips):

Switching to a mysql or postgresql backend (to make sure we aren't running in sqlite lock contention)
Putting a k8s ingress in front of the ara containers (up to you: haproxy, nginx, traefik, etc.)

The container images currently ship with this command: https://github.com/ansible-community/ara/blob/5fe28b7eeaa080664cb1bcbc1f1db4bd6d38ea11/contrib/container-images/fedora-pypi.sh#L24

For the sake of simplicity I have gone ahead and done a rebuild of the latest image, only changing the number of workers from 4 to 1. (@hille721 if you have any information or data regarding your additional settings maybe we can test that too)

You can try this image here: docker.io/dmsimard/ara-dont-use-this-for-prod:one-worker (via https://hub.docker.com/repository/docker/dmsimard/ara-dont-use-this-for-prod/general)

If you want to use MySQL you should have environment variables that look like this for where the ara server container runs:

ARA_DATABASE_CONN_MAX_AGE: 60
ARA_DATABASE_ENGINE: django.db.backends.mysql
ARA_DATABASE_HOST: mysql.host.name
ARA_DATABASE_NAME: ara
ARA_DATABASE_PASSWORD: password
ARA_DATABASE_PORT: 3306
ARA_DATABASE_USER: ara

For Postgre:

ARA_DATABASE_CONN_MAX_AGE: 60
ARA_DATABASE_ENGINE: django.db.backends.postgresql
ARA_DATABASE_HOST: postgresql.host.name
ARA_DATABASE_NAME: ara
ARA_DATABASE_PASSWORD: password
ARA_DATABASE_PORT: 5432
ARA_DATABASE_USER: ara

Please let me know how that works out and if you have any interesting findings we can work with.

Thanks !

VannTen commented 1 month ago

No problem, scale testing is painful.

However, this was just a POC to try out the UI, get a feel how we would use ARA (which is why we don't have put more than sqlite behind it)

We had the worker timeout pretty much immediately, even with no and very few data recorded.

SQLite lock contention seems a pretty unlikely culprit too me, given the volume (no more than 1 query and like 25 playbooks with 1-3 tasks each)

Nevertheless, if/when we put an actual DB behind, we'll see if this change and report back. (We don't yet put DB in Kubernetes, unfortunately, because we don't have performant storage directly available in the clusters)

https://pythonspeed.com/articles/gunicorn-in-docker/ Seems pretty interesting and has reasoning behind the options which seems pretty sound to me, probably what we're going to try next (--workers=1 ended up not making much of a difference, unfortunately).

dmsimard commented 1 month ago

For what it's worth, the ara server doesn't /need/ to run with gunicorn. Any WSGI servers known to run django will work as well (uwsgi, apache mod_wsgi, etc).

We feel the same about databases in k8s, the database server can run outside on bare metal or on a VM, etc., just have to be mindful of the network latency between the ara server and the database server.

That said, it feels like we might be missing something because the performance shouldn't be /that/ bad and errors shouldn't come up so easily, especially if you aren't running concurrent playbooks which could run into the sqlite lock issues.

Are you able to reproduce the kind of issues you are seeing if you try to run the container outside or k8s? I mean locally with podman or docker.

dmsimard commented 1 month ago

https://pythonspeed.com/articles/gunicorn-in-docker/ Seems pretty interesting and has reasoning behind the options which seems pretty sound to me, probably what we're going to try next (--workers=1 ended up not making much of a difference, unfortunately).

It suggests using --workers=2 --threads=4 --worker-class=gthread --worker-tmp-dir /dev/shm which was part of the command @hille721 provided. I can build an image with that so we can compare but it will be later -- I'm about to board a flight back home :p

dmsimard commented 1 month ago

I put up an image with those settings: docker.io/dmsimard/ara-dont-use-this-for-prod:w2-t4-gthread-shm

I will also do some testing on my end out of curiosity.

dmsimard commented 1 month ago

I have used a similar approach to benchmarking blog posts (database backends, ansible versions & ara) to test whether there is a significant difference between the current image and "tweaked" settings.

This is running locally on the same machine (16 cores, 32gb ram, modest SSDs) on fedora 40.

The results:

Screenshot from 2024-10-12 18-13-03

Screenshot from 2024-10-12 18-29-49

Stock (current image)

sqlite: 6m51s
mysql (with ARA_CALLBACK_THREADS=4): 4m59s

2 workers, 4 threads, gthread, /dev/shm

sqlite: 6m40s
mysql (with ARA_CALLBACK_THREADS=4): 4m54s

So, yes, while the numbers are slightly better using the 2 workers/4threads/gthread and /dev/shm options, it is almost negligible in practice: the benchmarking test playbook does nothing 10 000 times really fast.

In any case, I am unable to reproduce the extreme sluggishness you are seeing.

I will leave it at that for now but I am interested to learn if you find out anything.

Edit: tangentially related, these numbers are better than the ones last benchmarked by a significant margin:

Maybe we are due for a new blog post :)

VannTen commented 1 month ago

I think the most likely culprit is the /tmp. (We'll test using a memory emptyDir next week)

I need to confirm that, but I think docker/podman run mount a tmpfs on /tmp, which is not the case in a K8S pod.

And even if they don't, it's very possible that SSD on bare-metal vs virtualized storage (I need to recheck exactly what we have for the containers writable layer) prevent the detection.

We'll continue to investigate and will report back :)

Thanks !

dmsimard commented 2 weeks ago

Hi @VannTen, I'm reaching out to see if you ended up finding anything interesting.

Thanks,

VannTen commented 2 weeks ago

Hi, thanks for the ping ^

We ended up with roughly this

 python3 -m gunicorn ara.server.wsgi \
     --workers=2 \
     --threads=4 \ 
     --worker-class=gthread --worker-tmp-dir /dev/shm

This stopped the worst offenders but we still had some timeouts ; switching to postgres made everything way more smooth.

I'm not sure what made sqlite so bad. Maybe it's the interaction with overlayfs :thinking:.

We're not testing running kubespray to upgrade our clusters with ara enabled, too see what's the overhead (roughly). (I've looked with interest to the discussion in #459 meanwhile).

dmsimard commented 2 weeks ago

Thanks for reporting back :D

I have not revisited the topic about making the callback less blocking in a while and it could be worth looking into again.

With some time to think about it, the approach used in https://gist.github.com/phemmer/8ee4ea0ebf1b389050ce4a2bd78c66d6 could be shipped as an additional callback that people can use if need be. I need some time to test it out.

I will also add it to my to-do list for benchmarking I will be doing in the not-too-distant future.

ansible-community / ara