Open christianbundy opened 1 year ago
Maybe useful, here's a hacky fix I put together locally: edit workers/gthread.py
so that accept()
doesn't process accept new requests from the socket after force_close()
is called on the worker:
(Apologies for a crappy vim screenshot, I wanted to test this out before sitting down for dinner.)
New output:
I've found a related issue, ~which is not resolved by #3039, although it'd be unlikely to be deployed in the wild~: when max_requests >= threads
we see the same connection reset error.
For example:
gunicorn --worker-class gthread --max-requests 4 --threads 4 myapp:app
With this config, we can reproduce a consistent connection reset with only five HTTP requests.
After a git bisect
, it looks like the culprit is 0ebb73aa240f0ecffe3e0922d54cfece19f5bfed (https://github.com/benoitc/gunicorn/pull/2918).
We are seeing this when making API calls to NetBox as well. gunicorn version is 21.2.0. NGINX logs randomly show an upstream error "104: Connection reset by peer" which we correlate with "Autorestarting worker after current request." in gunicorn logs.
@MANT5149, I'm in the same boat as well, after NetBox upgrade to 3.5.7 (which includes gunicorn 21.2.0) we're seeing same issue when autorestart is happening:
Aug 10 11:40:43 hostname gunicorn[595183]: [2023-08-10 11:40:43 +0000] [595183] [INFO] Autorestarting worker after current request.
Aug 10 11:40:43 hostname gunicorn[595183]: [2023-08-10 11:40:43 +0000] [595183] [INFO] Worker exiting (pid: 595183)
Aug 10 11:40:44 hostname gunicorn[613565]: [2023-08-10 11:40:44 +0000] [613565] [INFO] Booting worker with pid: 613565
Aug 10 12:33:36 hostname gunicorn[622501]: [2023-08-10 12:33:36 +0000] [622501] [INFO] Autorestarting worker after current request.
Aug 10 12:33:36 hostname gunicorn[622501]: [2023-08-10 12:33:36 +0000] [622501] [INFO] Worker exiting (pid: 622501)
Aug 10 12:33:36 hostname gunicorn[639160]: [2023-08-10 12:33:36 +0000] [639160] [INFO] Booting worker with pid: 639160
Aug 10 13:00:06 hostname gunicorn[579373]: [2023-08-10 13:00:06 +0000] [579373] [INFO] Autorestarting worker after current request.
Aug 10 13:00:06 hostname gunicorn[579373]: [2023-08-10 13:00:06 +0000] [579373] [INFO] Worker exiting (pid: 579373)
Aug 10 13:00:07 hostname gunicorn[648814]: [2023-08-10 13:00:07 +0000] [648814] [INFO] Booting worker with pid: 648814
Nginx error log:
2023/08/10 11:40:43 [error] 1092#0: *744453 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.xxx.xxx, server: netbox.example.com, request: "GET /api/tenancy/tenants/19/ HTTP/1.1", upstream: "http://127.0.0.1:8001/api/tenancy/tenants/19/", host: "netbox-api.example.com"
2023/08/10 12:33:36 [error] 1092#0: *776456 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 192.168.xxx.xxx, server: netbox.example.com, request: "GET /api/virtualization/clusters/46/ HTTP/1.1", upstream: "http://127.0.0.1:8001/api/virtualization/clusters/46/", host: "netbox-api.example.com"
2023/08/10 13:00:06 [error] 1092#0: *787694 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.xxx.xxx, server: netbox.example.com, request: "GET /api/tenancy/tenants/19/ HTTP/1.1", upstream: "http://127.0.0.1:8001/api/tenancy/tenants/19/", host: "netbox-api.example.com"
Downgrading Gunicorn to 20.1.0 fixes this issue.
Have y'all tried the patch in #3039?
thanks for the report. I'ts expected to have the worker restart there. If a request already land in the worker when its closing that may happen. Maybe latest change accepting the requests faster trigger it. Can you give an idee of the number of concurrent request gunicorn is receiving also in usial case?
Beside that why do you put a max request so low? This function should only be used as a work around some temporarymemory issues in the application.
The worker restart is expected, but after 0ebb73a (#2918) every request runs through the while self.alive
loop twice:
accept()
on the socket to create the connectionrecv()
on the connection to handle the request and provide a responseThis means that during a max-requests restart, we call accept()
on requests but never recv()
, which means that they aren't added to self.futures
to be awaited during the graceful timeout.
Before the change, if you sent two requests around the same time you'd see:
self.alive is True
so we:
accept()
request Arecv()
request A, which sets self.alive = False
in handle_request
self.alive is False
, so we exit the loop and restart the workerself.alive is True
, so we:
accept()
request Brecv()
request BAfter the change, this becomes:
self.alive is True
so we accept()
request Aself.alive is True
so we:
recv()
request A, which sets self.alive = False
in handle_request
accept()
request Bself.alive is False
, so we exit the loop and restart the worker (‼️ without handling request B ‼️)Can you give an idee of the number of concurrent request gunicorn is receiving also in usial case?
This bug only requires two concurrent requests to a worker, but often I'll often have ~4 concurrent requests per worker and this bug means that 1 will be completed and the rest have their connections reset. ~To highlight the impact, this means that setting max-requests at 10k and averaging 4 concurrent requests will cause 0.03% of HTTP requests to fail.~ EDIT: I've included a benchmark in #3039 instead of speculating.
Beside that why do you put a max request so low? This function should only be used as a work around some temporarymemory issues in the application.
That's a minimal reproducible example, not the real application + configuration I'm using in production. I'm using a range from 1,000 to 1,000,000 depending on the exact application / deployment, and completely agree that --max-requests 10
is inappropriately low for most scenarios.
@benoitc I just added a benchmark to #3039 showing an error rate of 0.09% when we restart every 1024 requests, in case that's helpful.
Have y'all tried the patch in https://github.com/benoitc/gunicorn/pull/3039?
@christianbundy, I've tried it and it appears to fix this issue for me
0.x to 5
Are these CPU utilization percentages? i.e. CPU load goes increases from <1% to 5%? This makes sense because previously we were blocking and waiting on IO for two seconds while waiting for new connections, but now we're looping to check for either 'new connections' or 'new data on a connection'.
no, 5 means 5 cpu cores are 100% utilized (500%), I have 5 workers configured. The arrows point to bumps where I ran a script that fires many requests against a gunicorn server.
Thanks for the context @r-lindner -- admittedly I don't think I understand the difference between 'fix from #3038' and 'fix from #3039', but I can see that the CPU utilization is significantly higher.
I've just pushed a commit to my PR branch that resolves the issue on my machine, can you confirm whether it works for you? It's a significantly different approach, which I haven't tested as thoroughly, but it seems to resolve the connection resets and also keeps the CPU utilization low.
Before: idles at ~10% CPU
After: idles at ~0.01% CPU
Maybe useful, here's a hacky fix I put together locally: edit
workers/gthread.py
so thataccept()
doesn't process accept new requests from the socket afterforce_close()
is called on the worker:(Apologies for a crappy vim screenshot, I wanted to test this out before sitting down for dinner.)
New output:
this is the 'fix from #3038' and the 'fix from #3039' was the pull request with out the changes from Aug 26. I am now using the updated #3039 without CPU issues. Due to changes I made a week ago I cannot test if the original bug is fixed but I guess you already tested this :-) So this looks good.
When fronting gunicorn 20.1.0 with nginx 1.23.3, we observe "connection reset by peer" errors in Nginx that correlate with gthread worker auto restarts.
https://github.com/benoitc/gunicorn/issues/1236 seems related, which describes an issue specifically with keepalive connections. That issue is older and I am unsure of the current state, but this comment implies an ongoing issue. Note the original reproduction steps in this issue, 3038, have keepalive enabled by default.
When we disable keepalives in gunicorn, we observe a latency regression but it does stop the connection reset errors.
Should there be documented guidance, for now, not to use --max-requests
with keepalive + gthread workers?
As far as I can see, options for consumers are:
We face this issue exactly as described. Thanks for the reporting and ongoing work on this. Is there an ETA for the publication of a fix ?
Same issue here. I swapped to gthread workers from sync, and randomly, my server just stopped taking requests.
Reverted back to sync for now.
We are also running into this issue after a NetBox upgrade. Downgrading gunicorn to 20.1.0 fixes it for the moment but a proper fix would be appreciated.
We are also running into this problem after upgrading Netbox from 3.4.8 to 3.6.9, which makes gunicorn go from 20.1.0 to 21.2.0.
One of the heavier scripts works flawlessly on Netbox 3.4.8 (gunicorn 20.1.0), but on 3.6.9 (gunicorn 21.2.0) it fails with the below message and it has not failed at the exact same place twice:
Traceback (most recent call last): File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 394, in
main() File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 309, in main service_entry = nb_update_vs(nb_vs,sat_gw_ip,sat_gw_name,community_name,community_type,interop_hack_ip) File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 124, in nb_update_vs nb_service_entry_ip = str(nb_service_entry.ipaddresses[0]).split('/')[0] File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 327, in str getattr(self, "name", None) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 303, in getattr if self.full_details(): File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 459, in full_details self._parse_values(next(req.get())) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 291, in get req = self._make_call(add_params=add_params) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 258, in _make_call raise RequestError(req) pynetbox.core.query.RequestError: The request failed with code 502 Bad Gateway but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.
/var/log/nginx/error.log:
2024/01/08 14:27:35 [error] 919#919: *110908 recv() failed (104: Unknown error) while reading response header from upstream, client: 10.10.10.74, server: netbox-test.domain.test, request: "GET /api/ipam/ip-addresses/24202/ HTTP/1.1", upstream: "http://10.10.10.5:8001/api/ipam/ip-addresses/24202/", host: "netbox-test.domain.test"
gunicorn log:
Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Autorestarting worker after current request. Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Worker exiting (pid: 1129991) Jan 08 14:39:26 30001vmnb02-prod gunicorn[1139845]: [2024-01-08 14:39:26 +0100] [1139845] [INFO] Booting worker with pid: 1139845 Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Autorestarting worker after current request. Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Worker exiting (pid: 1129962) Jan 08 14:44:11 30001vmnb02-prod gunicorn[1139926]: [2024-01-08 14:44:11 +0100] [1139926] [INFO] Booting worker with pid: 1139926
Versions: Netbox: 3.6.9 Python: 3.10.12 Redis version=6.0.16 nginx version: nginx/1.18.0 (Ubuntu) psql (PostgreSQL) 14.10 (Ubuntu 14.10-0ubuntu0.22.04.1) gunicorn (version 21.2.0) pynetbox: 7.3.3
Linux vmnb02-test 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Is there a release underway to fix this or should we still refrain from upgrading? Current state of gunicorn dictates that it is not production-worthy. :(
We are also running into this problem after upgrading Netbox from 3.4.8 to 3.6.9, which makes gunicorn go from 20.1.0 to 21.2.0.
One of the heavier scripts works flawlessly on Netbox 3.4.8 (gunicorn 20.1.0), but on 3.6.9 (gunicorn 21.2.0) it fails with the below message and it has not failed at the exact same place twice:
Traceback (most recent call last): File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 394, in main() File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 309, in main service_entry = nb_update_vs(nb_vs,sat_gw_ip,sat_gw_name,community_name,community_type,interop_hack_ip) File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 124, in nb_update_vs nb_service_entry_ip = str(nb_service_entry.ipaddresses[0]).split('/')[0] File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 327, in str getattr(self, "name", None) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 303, in getattr if self.full_details(): File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 459, in full_details self._parse_values(next(req.get())) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 291, in get req = self._make_call(add_params=add_params) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 258, in _make_call raise RequestError(req) pynetbox.core.query.RequestError: The request failed with code 502 Bad Gateway but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.
/var/log/nginx/error.log:
2024/01/08 14:27:35 [error] 919#919: *110908 recv() failed (104: Unknown error) while reading response header from upstream, client: 10.10.10.74, server: netbox-test.domain.test, request: "GET /api/ipam/ip-addresses/24202/ HTTP/1.1", upstream: "http://10.10.10.5:8001/api/ipam/ip-addresses/24202/", host: "netbox-test.domain.test"
gunicorn log:
Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Autorestarting worker after current request. Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Worker exiting (pid: 1129991) Jan 08 14:39:26 30001vmnb02-prod gunicorn[1139845]: [2024-01-08 14:39:26 +0100] [1139845] [INFO] Booting worker with pid: 1139845 Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Autorestarting worker after current request. Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Worker exiting (pid: 1129962) Jan 08 14:44:11 30001vmnb02-prod gunicorn[1139926]: [2024-01-08 14:44:11 +0100] [1139926] [INFO] Booting worker with pid: 1139926
Versions: Netbox: 3.6.9 Python: 3.10.12 Redis version=6.0.16 nginx version: nginx/1.18.0 (Ubuntu) psql (PostgreSQL) 14.10 (Ubuntu 14.10-0ubuntu0.22.04.1) gunicorn (version 21.2.0) pynetbox: 7.3.3
Linux vmnb02-test 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Is there a release underway to fix this or should we still refrain from upgrading? Current state of gunicorn dictates that it is not production-worthy. :(
I have been testing a few things and this is my finding. It doesn't help if I decrease the max_requests setting. It just fails sooner. it doesn't help if I increase the max_requests setting either. It still fails at some point. When it fails a log-entry about restarting shows up simultaneously:
Example:
an 09 13:20:07 30001vmnb02-test gunicorn[106082]: [2024-01-09 13:20:07 +0100] [106082] [INFO] Autorestarting worker after current request. Jan 09 13:20:07 30001vmnb02-test gunicorn[106082]: [2024-01-09 13:20:07 +0100] [106082] [INFO] Worker exiting (pid: 106082) Jan 09 13:20:08 30001vmnb02-test gunicorn[106256]: [2024-01-09 13:20:08 +0100] [106256] [INFO] Booting worker with pid: 106256
If I set the max_requests to 0, disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is.
Perhaps a scheduled restart of the Netbox and netbox-rq services (thereby restarting gunicorn worker processes) once a day would do the trick?
If I set the max_requests to 0, disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is.
Perhaps a scheduled restart of the Netbox and netbox-rq services (thereby restarting gunicorn worker processes) once a day would do the trick?
I have come to the conclusion, that rather than downgrade gunicorn of maybe loose some necessary features, I will go ahead with max_requests set i 0 and if memory usage becomes an issue on the server I will set up a scheduled job that restarts the worker processes with this command:
ps -aux | grep venv/bin/gunicorn | grep Ss | awk '{ system("kill -HUP " $2 )}'
Just don't pass the max_request option? i never use it myself. It's only when i have a temporary memory leak and never in production.
Le jeu. 11 janv. 2024 à 13:13, Ian S. @.***> a écrit :
If I set the max_requests to 0, disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is.
Perhaps a scheduled restart of the Netbox and netbox-rq services (thereby restarting gunicorn worker processes) once a day would do the trick?
I have come to the conclusion, that rather than downgrade gunicorn of maybe loose some necessary features, I will go ahead with max_requests set i 0 and if memory usage becomes an issue on the server I will set up a scheduled job that restarts the worker processes with this command:
ps -aux | grep venv/bin/gunicorn | grep Ss | awk '{ system("kill -HUP " $2 )}'
— Reply to this email directly, view it on GitHub https://github.com/benoitc/gunicorn/issues/3038#issuecomment-1887034359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADRIW3Y346USVNBLVBBXLYN7JP7AVCNFSM6AAAAAA2SEKTSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXGAZTIMZVHE . You are receiving this because you were mentioned.Message ID: @.***>
For the record, there exists a variety of situation where the memory leaks are difficult to address:
We stayed on a lower release version to avoid this issue. However, we have to upgrade due to HTTP Request Smuggling (CVE-2024-1135) vulnerability. Is there anyone able to successfully workaround this issue (short of turning off max-requests)?
@rajivramanathan don't use max-requests? Max requests is there for the worst casse when your application leaks.
Benoît, I think we understand your advice, but many apps may find themselves in the "application leaks and we can't do much about it" place, hence the usefulness of max-requests.
We stayed on a lower release version to avoid this issue. However, we have to upgrade due to HTTP Request Smuggling (CVE-2024-1135) vulnerability. Is there anyone able to successfully workaround this issue (short of turning off max-requests)?
We have NGINX in front of Gunicorn so we addressed it by setting up multiple instances of Gunicorn running upstream listening to different ports and using http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream configuration to try next upstream if we encounter 502 error.
Hello,
In our setting max-request
is very helpful too, and we ran into the same issue when upgrading to gunicorn>=21.
I wanted to mention that #3157 fixes it !
Just for the record, there exists a situation where the Python default memory allocator produces (under specific circumstances) very fragmented arenas which leads to the interpreter not giving back unused memory - and this might be our case. Using jemalloc (see https://zapier.com/engineering/celery-python-jemalloc/) may alleviate the issue. We are considering this, too. However if https://github.com/benoitc/gunicorn/pull/3157 is green, we will be happy to keep using max-requests.
Have run into this in my production so would love to see the patch merged. Have pulled it in and have seen a reduction in 503s from the connection reset.
We're observing intermittent HTTP 502s in production, which seem to be correlated with the "autorestarting worker after current request" log line, and are less frequent as we increase
max_requests
. I've reproduced this on 21.2.0 and 20.1.0, but it doesn't seem to happen in 20.0.4.I've produced a minimal reproduction case following the gunicorn.org example as closely as possible, but please let me know if there are other changes you'd recommend:
Application
Configuration
Reproduction
Quickstart
For convenience, I've packaged this into a single command that consistently reproduces the problem on my machine. If you have Docker installed, this should Just Work™️.
Example
Logs
Expected
I'd expect to receive an HTTP 200 for each request, regardless of the max-requests restart. We should see
[DEBUG] GET /11
when the worker handles the 11th request.Actual
The reproduction script sends
GET /11
, but the worker never sees it, and we see a connection reset instead. The repro script reports a status code of000
, but that's just a quirk of libcurl. I've used tcpdump and can confirm theRST
.In case it's useful, I've also seen
curl: (52) Empty reply from server
, but it happens less frequently and I'm not 100% sure that it's the same problem.Workaround
Increasing max-requests makes this happen less frequently, but the only way to resolve it is to disable max-requests (or maybe switch to a different worker type?). Increasing the number of workers or threads doesn't seem to resolve the problem either, from what I've seen.