Open WesleyAC opened 1 year ago
Some discussion in https://github.com/benoitc/gunicorn/issues/1056 suggests that --preload
may be part of the problem here, which would make sense. I was using preload to cut down on memory, since we're generally flying pretty close to the sun on memory usage, but potentially I could reduce the number of workers and still get better performance with gevent than with sync workers.
Another relevant gunicorn issue: https://github.com/benoitc/gunicorn/issues/1566. Seems highly likely that this is triggered by using --preload
, and gevent without preload would work fine.
In case it's useful, here's the current (gthread) gunicorn config that I'm using on bookwyrm.social: gunicorn bookwyrm.wsgi:application --workers 6 --threads 5 --preload --max-requests 100 --max-requests-jitter 100 --bind 0.0.0.0:8000
It's tuned to the particular load, memory, and CPU constraints that we have, but it seems to be fairly stable.
By default, we use synchronous gunicorn workers, which, given #2717, causes performance problems under load. Using gevent workers seemed to fix this (from testing on bookwyrm.social with
gunicorn bookwyrm.wsgi:application --worker-class gevent --worker-connections 2048 --workers 18 --preload --max-requests 300 --max-requests-jitter 300 --bind 0.0.0.0:8000
), but leads to the following problem:/usr/local/lib/python3.9/site-packages/gunicorn/workers/ggevent.py:53: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. It may also silently lead to incorrect behaviour on Python 3.7. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016. Modules that had direct imports (NOT patched): ['aiohttp.web (/usr/local/lib/python3.9/site-packages/aiohttp/web.py)', 'urllib3.util.ssl_ (/usr/local/lib/python3.9/site-packages/urllib3/util/ssl_.py)', 'aiohttp.client_reqrep (/usr/local/lib/python3.9/site-packages/aiohttp/client_reqrep.py)', 'botocore.httpsession (/usr/local/lib/python3.9/site-packages/botocore/httpsession.py)', 'aiohttp.connector (/usr/local/lib/python3.9/site-packages/aiohttp/connector.py)', 'aiohttp.worker (/usr/local/lib/python3.9/site-packages/aiohttp/worker.py)', 'aiohttp.web_runner (/usr/local/lib/python3.9/site-packages/aiohttp/web_runner.py)', 'aiohttp.client_exceptions (/usr/local/lib/python3.9/site-packages/aiohttp/client_exceptions.py)', 'aiohttp.client (/usr/local/lib/python3.9/site-packages/aiohttp/client.py)', 'urllib3.util (/usr/local/lib/python3.9/site-packages/urllib3/util/__init__.py)'].
which indeed causes many
RecursionError
s in practice:It seems that any request making a outgoing HTTPS connection (most of which are described in #2717) caused this issue, resulting in a 500 error. I think #2730 and #2729 may have been instances of this problem (although #2729 I'm less sure about, since it's unclear to me if uploading a image would cause this same problem).
We should figure out how to get the import/monkeypatching order correct to avoid this issue, so as to enable the more efficient for our usecase gevent workers.
Apologies for the disruption caused by this — I've been pretty aggressive in making changes in prod in order to try to get bookwyrm.social running more stably, but I should've tested this change a little more carefully.