christianbundy commented 1 year ago

We're observing intermittent HTTP 502s in production, which seem to be correlated with the "autorestarting worker after current request" log line, and are less frequent as we increase max_requests. I've reproduced this on 21.2.0 and 20.1.0, but it doesn't seem to happen in 20.0.4.

I've produced a minimal reproduction case following the gunicorn.org example as closely as possible, but please let me know if there are other changes you'd recommend:

Application

def app(environ, start_response):
    data = b"Hello, World!\n"
    start_response("200 OK", [
        ("Content-Type", "text/plain"),
        ("Content-Length", str(len(data)))
    ])
    return iter([data])

Configuration

gunicorn --max-requests 10 --worker-class gthread myapp:app

Reproduction

#!/usr/bin/env bash

set -euo pipefail

# start gunicorn
gunicorn --max-requests 10 --log-level debug --worker-class gthread myapp:app &

# wait for gunicorn to be ready
sleep 5

# send 20 sequential HTTP requests with minimal output
for i in {1..20}; do
    curl -S -s -o /dev/null -w "[REPRO]: Received HTTP %{http_code} from GET /${i}\n" "127.0.0.1:8000/${i}" || true
done

Quickstart

For convenience, I've packaged this into a single command that consistently reproduces the problem on my machine. If you have Docker installed, this should Just Work™️.

cat <<DOCKER_EOF | docker build -t gunicorn-repro-max-requests-rst - && docker run --rm gunicorn-repro-max-requests-rst
FROM python:3.11

RUN pip install --disable-pip-version-check --no-cache gunicorn

RUN cat <<EOF > myapp.py
def app(environ, start_response):
    data = b"Hello, World!\n"
    start_response("200 OK", [
        ("Content-Type", "text/plain"),
        ("Content-Length", str(len(data)))
    ])
    return iter([data])
EOF

RUN cat <<EOF > repro.sh
#!/usr/bin/env bash

set -euo pipefail

# start gunicorn
gunicorn --max-requests 10 --log-level debug --worker-class gthread myapp:app &

# wait for gunicorn to be ready
sleep 5

# send 20 sequential HTTP requests with minimal output
for i in {1..20}; do
    curl -S -s -o /dev/null -w "[REPRO]: Received HTTP %{http_code} from GET /\\\${i}\n" "127.0.0.1:8000/\\\${i}" || true
done
EOF

RUN chmod +x repro.sh

CMD ["./repro.sh"]

DOCKER_EOF

Example

Logs

``` [2023-07-21 00:20:10 +0000] [6] [DEBUG] Current configuration: config: ./gunicorn.conf.py wsgi_app: None bind: ['127.0.0.1:8000'] backlog: 2048 workers: 1 worker_class: gthread threads: 1 worker_connections: 1000 max_requests: 10 max_requests_jitter: 0 timeout: 30 graceful_timeout: 30 keepalive: 2 limit_request_line: 4094 limit_request_fields: 100 limit_request_field_size: 8190 reload: False reload_engine: auto reload_extra_files: [] spew: False check_config: False print_config: False preload_app: False sendfile: None reuse_port: False chdir: / daemon: False raw_env: [] pidfile: None worker_tmp_dir: None user: 0 group: 0 umask: 0 initgroups: False tmp_upload_dir: None secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'} forwarded_allow_ips: ['127.0.0.1'] accesslog: None disable_redirect_access_to_syslog: False access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" errorlog: - loglevel: debug capture_output: False logger_class: gunicorn.glogging.Logger logconfig: None logconfig_dict: {} logconfig_json: None syslog_addr: udp://localhost:514 syslog: False syslog_prefix: None syslog_facility: user enable_stdio_inheritance: False statsd_host: None dogstatsd_tags: statsd_prefix: proc_name: None default_proc_name: myapp:app pythonpath: None paste: None on_starting: on_reload: when_ready: pre_fork: post_fork: post_worker_init: worker_int: worker_abort: pre_exec: pre_request: post_request: child_exit: worker_exit: nworkers_changed: on_exit: ssl_context: proxy_protocol: False proxy_allow_ips: ['127.0.0.1'] keyfile: None certfile: None ssl_version: 2 cert_reqs: 0 ca_certs: None suppress_ragged_eofs: True do_handshake_on_connect: False ciphers: None raw_paste_global_conf: [] strip_header_spaces: False [2023-07-21 00:20:10 +0000] [6] [INFO] Starting gunicorn 21.2.0 [2023-07-21 00:20:10 +0000] [6] [DEBUG] Arbiter booted [2023-07-21 00:20:10 +0000] [6] [INFO] Listening at: http://127.0.0.1:8000 (6) [2023-07-21 00:20:10 +0000] [6] [INFO] Using worker: gthread [2023-07-21 00:20:10 +0000] [8] [INFO] Booting worker with pid: 8 [2023-07-21 00:20:10 +0000] [6] [DEBUG] 1 workers [2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /1 [REPRO]: Received HTTP 200 from GET /1 [2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. [2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /2 [2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /2 [2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /3 [REPRO]: Received HTTP 200 from GET /3 [2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. [2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /4 [2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /4 [2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /5 [2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /5 [2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /6 [2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /6 [2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /7 [REPRO]: Received HTTP 200 from GET /7 [2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. [2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /8 [2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /8 [2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /9 [2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /9 [2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /10 [2023-07-21 00:20:15 +0000] [8] [INFO] Autorestarting worker after current request. [2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /10 curl: (56) Recv failure: Connection reset by peer [2023-07-21 00:20:15 +0000] [8] [INFO] Worker exiting (pid: 8) [REPRO]: Received HTTP 000 from GET /11 [2023-07-21 00:20:15 +0000] [22] [INFO] Booting worker with pid: 22 [2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /12 [2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /12 [2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /13 [REPRO]: Received HTTP 200 from GET /13 [2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. [2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /14 [2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /14 [2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /15 [2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /15 [2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /16 [REPRO]: Received HTTP 200 from GET /16 [2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. [2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /17 [2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /17 [2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /18 [2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /18 [2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /19 [REPRO]: Received HTTP 200 from GET /19 [2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. [2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /20 [2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /20 ```

Expected

I'd expect to receive an HTTP 200 for each request, regardless of the max-requests restart. We should see [DEBUG] GET /11 when the worker handles the 11th request.

Actual

The reproduction script sends GET /11, but the worker never sees it, and we see a connection reset instead. The repro script reports a status code of 000, but that's just a quirk of libcurl. I've used tcpdump and can confirm the RST.

In case it's useful, I've also seen curl: (52) Empty reply from server, but it happens less frequently and I'm not 100% sure that it's the same problem.

Workaround

Increasing max-requests makes this happen less frequently, but the only way to resolve it is to disable max-requests (or maybe switch to a different worker type?). Increasing the number of workers or threads doesn't seem to resolve the problem either, from what I've seen.

christianbundy commented 1 year ago

Maybe useful, here's a hacky fix I put together locally: edit workers/gthread.py so that accept() doesn't process accept new requests from the socket after force_close() is called on the worker:

(Apologies for a crappy vim screenshot, I wanted to test this out before sitting down for dinner.)

New output:

``` [2023-07-21 02:51:31 +0000] [641] [INFO] Starting gunicorn 21.2.0 [2023-07-21 02:51:31 +0000] [641] [DEBUG] Arbiter booted [2023-07-21 02:51:31 +0000] [641] [INFO] Listening at: http://127.0.0.1:8000 (641) [2023-07-21 02:51:31 +0000] [641] [INFO] Using worker: gthread [2023-07-21 02:51:31 +0000] [643] [INFO] Booting worker with pid: 643 [2023-07-21 02:51:31 +0000] [641] [DEBUG] 1 workers [2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /1 [REPRO]: Received HTTP 200 from GET /1 [2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /2 [REPRO]: Received HTTP 200 from GET /2 [2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /3 [REPRO]: Received HTTP 200 from GET /3 [2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /4 [REPRO]: Received HTTP 200 from GET /4 [2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /5 [REPRO]: Received HTTP 200 from GET /5 [2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /6 [REPRO]: Received HTTP 200 from GET /6 [2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /7 [REPRO]: Received HTTP 200 from GET /7 [2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /8 [REPRO]: Received HTTP 200 from GET /8 [2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /9 [REPRO]: Received HTTP 200 from GET /9 [2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /10 [2023-07-21 02:51:36 +0000] [643] [INFO] Autorestarting worker after current request. [2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /10 [2023-07-21 02:51:36 +0000] [643] [DEBUG] accept() called when alive=False [2023-07-21 02:51:36 +0000] [643] [INFO] Worker exiting (pid: 643) [2023-07-21 02:51:36 +0000] [656] [INFO] Booting worker with pid: 656 [2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /11 [REPRO]: Received HTTP 200 from GET /11 [2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /12 [REPRO]: Received HTTP 200 from GET /12 [2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /13 [REPRO]: Received HTTP 200 from GET /13 [2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /14 [REPRO]: Received HTTP 200 from GET /14 [2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /15 [REPRO]: Received HTTP 200 from GET /15 [2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connectio n. [2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /16 [REPRO]: Received HTTP 200 from GET /16 [2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /17 [REPRO]: Received HTTP 200 from GET /17 [2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /18 [REPRO]: Received HTTP 200 from GET /18 [2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /19 [REPRO]: Received HTTP 200 from GET /19 [2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. [2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /20 [2023-07-21 02:51:36 +0000] [656] [INFO] Autorestarting worker after current request. [2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /20 ```

christianbundy commented 1 year ago

I've found a related issue, ~which is not resolved by #3039, although it'd be unlikely to be deployed in the wild~: when max_requests >= threads we see the same connection reset error.

For example:

gunicorn --worker-class gthread --max-requests 4 --threads 4 myapp:app

With this config, we can reproduce a consistent connection reset with only five HTTP requests.

``` [2023-07-21 15:59:17 +0000] [7] [DEBUG] Current configuration: config: ./gunicorn.conf.py wsgi_app: None bind: ['127.0.0.1:8000'] backlog: 2048 workers: 1 worker_class: gthread threads: 4 worker_connections: 1000 max_requests: 4 max_requests_jitter: 0 timeout: 30 graceful_timeout: 30 keepalive: 2 limit_request_line: 4094 limit_request_fields: 100 limit_request_field_size: 8190 reload: False reload_engine: auto reload_extra_files: [] spew: False check_config: False print_config: False preload_app: False sendfile: None reuse_port: False chdir: / daemon: False raw_env: [] pidfile: None worker_tmp_dir: None user: 0 group: 0 umask: 0 initgroups: False tmp_upload_dir: None secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'} forwarded_allow_ips: ['127.0.0.1'] accesslog: None disable_redirect_access_to_syslog: False access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" errorlog: - loglevel: debug capture_output: False logger_class: gunicorn.glogging.Logger logconfig: None logconfig_dict: {} logconfig_json: None syslog_addr: udp://localhost:514 syslog: False syslog_prefix: None syslog_facility: user enable_stdio_inheritance: False statsd_host: None dogstatsd_tags: statsd_prefix: proc_name: None default_proc_name: myapp:app pythonpath: None paste: None on_starting: on_reload: when_ready: pre_fork: post_fork: post_worker_init: worker_int: worker_abort: pre_exec: pre_request: post_request: child_exit: worker_exit: nworkers_changed: on_exit: ssl_context: proxy_protocol: False proxy_allow_ips: ['127.0.0.1'] keyfile: None certfile: None ssl_version: 2 cert_reqs: 0 ca_certs: None suppress_ragged_eofs: True do_handshake_on_connect: False ciphers: None raw_paste_global_conf: [] strip_header_spaces: False [2023-07-21 15:59:17 +0000] [7] [INFO] Starting gunicorn 21.2.0 [2023-07-21 15:59:17 +0000] [7] [DEBUG] Arbiter booted [2023-07-21 15:59:17 +0000] [7] [INFO] Listening at: http://127.0.0.1:8000 (7) [2023-07-21 15:59:17 +0000] [7] [INFO] Using worker: gthread [2023-07-21 15:59:17 +0000] [9] [INFO] Booting worker with pid: 9 [2023-07-21 15:59:17 +0000] [7] [DEBUG] 1 workers [2023-07-21 15:59:22 +0000] [9] [DEBUG] GET /1 [2023-07-21 15:59:22 +0000] [9] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /1 [2023-07-21 15:59:22 +0000] [9] [DEBUG] GET /2 [2023-07-21 15:59:22 +0000] [9] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /2 [2023-07-21 15:59:22 +0000] [9] [DEBUG] GET /3 [2023-07-21 15:59:22 +0000] [9] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /3 [2023-07-21 15:59:22 +0000] [9] [DEBUG] GET /4 [2023-07-21 15:59:22 +0000] [9] [INFO] Autorestarting worker after current request. [2023-07-21 15:59:22 +0000] [9] [DEBUG] Closing connection. [REPRO]: Received HTTP 200 from GET /4 curl: (56) Recv failure: Connection reset by peer [2023-07-21 15:59:22 +0000] [9] [INFO] Worker exiting (pid: 9) [REPRO]: Received HTTP 000 from GET /5 ```

christianbundy commented 1 year ago

After a git bisect, it looks like the culprit is 0ebb73aa240f0ecffe3e0922d54cfece19f5bfed (https://github.com/benoitc/gunicorn/pull/2918).

MANT5149 commented 1 year ago

We are seeing this when making API calls to NetBox as well. gunicorn version is 21.2.0. NGINX logs randomly show an upstream error "104: Connection reset by peer" which we correlate with "Autorestarting worker after current request." in gunicorn logs.

karlism commented 1 year ago

@MANT5149, I'm in the same boat as well, after NetBox upgrade to 3.5.7 (which includes gunicorn 21.2.0) we're seeing same issue when autorestart is happening:

Aug 10 11:40:43 hostname gunicorn[595183]: [2023-08-10 11:40:43 +0000] [595183] [INFO] Autorestarting worker after current request.
Aug 10 11:40:43 hostname gunicorn[595183]: [2023-08-10 11:40:43 +0000] [595183] [INFO] Worker exiting (pid: 595183)
Aug 10 11:40:44 hostname gunicorn[613565]: [2023-08-10 11:40:44 +0000] [613565] [INFO] Booting worker with pid: 613565
Aug 10 12:33:36 hostname gunicorn[622501]: [2023-08-10 12:33:36 +0000] [622501] [INFO] Autorestarting worker after current request.
Aug 10 12:33:36 hostname gunicorn[622501]: [2023-08-10 12:33:36 +0000] [622501] [INFO] Worker exiting (pid: 622501)
Aug 10 12:33:36 hostname gunicorn[639160]: [2023-08-10 12:33:36 +0000] [639160] [INFO] Booting worker with pid: 639160
Aug 10 13:00:06 hostname gunicorn[579373]: [2023-08-10 13:00:06 +0000] [579373] [INFO] Autorestarting worker after current request.
Aug 10 13:00:06 hostname gunicorn[579373]: [2023-08-10 13:00:06 +0000] [579373] [INFO] Worker exiting (pid: 579373)
Aug 10 13:00:07 hostname gunicorn[648814]: [2023-08-10 13:00:07 +0000] [648814] [INFO] Booting worker with pid: 648814

Nginx error log:

2023/08/10 11:40:43 [error] 1092#0: *744453 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.xxx.xxx, server: netbox.example.com, request: "GET /api/tenancy/tenants/19/ HTTP/1.1", upstream: "http://127.0.0.1:8001/api/tenancy/tenants/19/", host: "netbox-api.example.com"
2023/08/10 12:33:36 [error] 1092#0: *776456 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 192.168.xxx.xxx, server: netbox.example.com, request: "GET /api/virtualization/clusters/46/ HTTP/1.1", upstream: "http://127.0.0.1:8001/api/virtualization/clusters/46/", host: "netbox-api.example.com"
2023/08/10 13:00:06 [error] 1092#0: *787694 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.xxx.xxx, server: netbox.example.com, request: "GET /api/tenancy/tenants/19/ HTTP/1.1", upstream: "http://127.0.0.1:8001/api/tenancy/tenants/19/", host: "netbox-api.example.com"

Downgrading Gunicorn to 20.1.0 fixes this issue.

christianbundy commented 1 year ago

Have y'all tried the patch in #3039?

benoitc commented 1 year ago

thanks for the report. I'ts expected to have the worker restart there. If a request already land in the worker when its closing that may happen. Maybe latest change accepting the requests faster trigger it. Can you give an idee of the number of concurrent request gunicorn is receiving also in usial case?

Beside that why do you put a max request so low? This function should only be used as a work around some temporarymemory issues in the application.

christianbundy commented 1 year ago

The worker restart is expected, but after 0ebb73a (#2918) every request runs through the while self.alive loop twice:

the first time through the loop it calls accept() on the socket to create the connection
the second time through the loop it calls recv() on the connection to handle the request and provide a response

This means that during a max-requests restart, we call accept() on requests but never recv(), which means that they aren't added to self.futures to be awaited during the graceful timeout.

Before the change, if you sent two requests around the same time you'd see:

self.alive is True so we:
1. accept() request A
2. recv() request A, which sets self.alive = False in handle_request
3. worker restart
self.alive is False, so we exit the loop and restart the worker
self.alive is True, so we:
1. accept() request B
2. recv() request B

After the change, this becomes:

self.alive is True so we accept() request A
self.alive is True so we:
1. recv() request A, which sets self.alive = False in handle_request
2. accept() request B
self.alive is False, so we exit the loop and restart the worker (‼️ without handling request B ‼️)

Can you give an idee of the number of concurrent request gunicorn is receiving also in usial case?

This bug only requires two concurrent requests to a worker, but often I'll often have ~4 concurrent requests per worker and this bug means that 1 will be completed and the rest have their connections reset. ~To highlight the impact, this means that setting max-requests at 10k and averaging 4 concurrent requests will cause 0.03% of HTTP requests to fail.~ EDIT: I've included a benchmark in #3039 instead of speculating.

Beside that why do you put a max request so low? This function should only be used as a work around some temporarymemory issues in the application.

That's a minimal reproducible example, not the real application + configuration I'm using in production. I'm using a range from 1,000 to 1,000,000 depending on the exact application / deployment, and completely agree that --max-requests 10 is inappropriately low for most scenarios.

christianbundy commented 1 year ago

@benoitc I just added a benchmark to #3039 showing an error rate of 0.09% when we restart every 1024 requests, in case that's helpful.

karlism commented 1 year ago

Have y'all tried the patch in https://github.com/benoitc/gunicorn/pull/3039?

@christianbundy, I've tried it and it appears to fix this issue for me

r-lindner commented 1 year ago

3039 fixes this issue BUT for me the idle load goes up from 0.x to 5!

christianbundy commented 1 year ago

0.x to 5

Are these CPU utilization percentages? i.e. CPU load goes increases from <1% to 5%? This makes sense because previously we were blocking and waiting on IO for two seconds while waiting for new connections, but now we're looping to check for either 'new connections' or 'new data on a connection'.

r-lindner commented 1 year ago

no, 5 means 5 cpu cores are 100% utilized (500%), I have 5 workers configured. The arrows point to bumps where I ran a script that fires many requests against a gunicorn server.

netbox_load

christianbundy commented 1 year ago

Thanks for the context @r-lindner -- admittedly I don't think I understand the difference between 'fix from #3038' and 'fix from #3039', but I can see that the CPU utilization is significantly higher.

I've just pushed a commit to my PR branch that resolves the issue on my machine, can you confirm whether it works for you? It's a significantly different approach, which I haven't tested as thoroughly, but it seems to resolve the connection resets and also keeps the CPU utilization low.

Before: idles at ~10% CPU

Screenshot 2023-08-26 at 1 28 38 PM

After: idles at ~0.01% CPU

Screenshot 2023-08-26 at 1 28 54 PM

r-lindner commented 1 year ago

Maybe useful, here's a hacky fix I put together locally: edit workers/gthread.py so that accept() doesn't process accept new requests from the socket after force_close() is called on the worker:

(Apologies for a crappy vim screenshot, I wanted to test this out before sitting down for dinner.)

New output:

this is the 'fix from #3038' and the 'fix from #3039' was the pull request with out the changes from Aug 26. I am now using the updated #3039 without CPU issues. Due to changes I made a week ago I cannot test if the original bug is fixed but I guess you already tested this :-) So this looks good.

zdicesare commented 1 year ago

When fronting gunicorn 20.1.0 with nginx 1.23.3, we observe "connection reset by peer" errors in Nginx that correlate with gthread worker auto restarts.

https://github.com/benoitc/gunicorn/issues/1236 seems related, which describes an issue specifically with keepalive connections. That issue is older and I am unsure of the current state, but this comment implies an ongoing issue. Note the original reproduction steps in this issue, 3038, have keepalive enabled by default.

When we disable keepalives in gunicorn, we observe a latency regression but it does stop the connection reset errors.

Should there be documented guidance, for now, not to use --max-requests with keepalive + gthread workers?

As far as I can see, options for consumers are:

Disable keepalive if using gthread workers + with max requests
Possibly use gevent workers instead

zogzog commented 1 year ago

We face this issue exactly as described. Thanks for the reporting and ongoing work on this. Is there an ETA for the publication of a fix ?

jessielw commented 11 months ago

Same issue here. I swapped to gthread workers from sync, and randomly, my server just stopped taking requests.

Reverted back to sync for now.

v0tti commented 11 months ago

We are also running into this issue after a NetBox upgrade. Downgrading gunicorn to 20.1.0 fixes it for the moment but a proper fix would be appreciated.

nomad-cyanide commented 10 months ago

We are also running into this problem after upgrading Netbox from 3.4.8 to 3.6.9, which makes gunicorn go from 20.1.0 to 21.2.0.

One of the heavier scripts works flawlessly on Netbox 3.4.8 (gunicorn 20.1.0), but on 3.6.9 (gunicorn 21.2.0) it fails with the below message and it has not failed at the exact same place twice:

Traceback (most recent call last): File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 394, in main() File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 309, in main service_entry = nb_update_vs(nb_vs,sat_gw_ip,sat_gw_name,community_name,community_type,interop_hack_ip) File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 124, in nb_update_vs nb_service_entry_ip = str(nb_service_entry.ipaddresses[0]).split('/')[0] File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 327, in str getattr(self, "name", None) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 303, in getattr if self.full_details(): File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 459, in full_details self._parse_values(next(req.get())) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 291, in get req = self._make_call(add_params=add_params) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 258, in _make_call raise RequestError(req) pynetbox.core.query.RequestError: The request failed with code 502 Bad Gateway but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.

/var/log/nginx/error.log:

2024/01/08 14:27:35 [error] 919#919: *110908 recv() failed (104: Unknown error) while reading response header from upstream, client: 10.10.10.74, server: netbox-test.domain.test, request: "GET /api/ipam/ip-addresses/24202/ HTTP/1.1", upstream: "http://10.10.10.5:8001/api/ipam/ip-addresses/24202/", host: "netbox-test.domain.test"

gunicorn log:

Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Autorestarting worker after current request. Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Worker exiting (pid: 1129991) Jan 08 14:39:26 30001vmnb02-prod gunicorn[1139845]: [2024-01-08 14:39:26 +0100] [1139845] [INFO] Booting worker with pid: 1139845 Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Autorestarting worker after current request. Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Worker exiting (pid: 1129962) Jan 08 14:44:11 30001vmnb02-prod gunicorn[1139926]: [2024-01-08 14:44:11 +0100] [1139926] [INFO] Booting worker with pid: 1139926

Versions: Netbox: 3.6.9 Python: 3.10.12 Redis version=6.0.16 nginx version: nginx/1.18.0 (Ubuntu) psql (PostgreSQL) 14.10 (Ubuntu 14.10-0ubuntu0.22.04.1) gunicorn (version 21.2.0) pynetbox: 7.3.3

Linux vmnb02-test 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Is there a release underway to fix this or should we still refrain from upgrading? Current state of gunicorn dictates that it is not production-worthy. :(

nomad-cyanide commented 10 months ago

We are also running into this problem after upgrading Netbox from 3.4.8 to 3.6.9, which makes gunicorn go from 20.1.0 to 21.2.0.

One of the heavier scripts works flawlessly on Netbox 3.4.8 (gunicorn 20.1.0), but on 3.6.9 (gunicorn 21.2.0) it fails with the below message and it has not failed at the exact same place twice:

Traceback (most recent call last): File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 394, in main() File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 309, in main service_entry = nb_update_vs(nb_vs,sat_gw_ip,sat_gw_name,community_name,community_type,interop_hack_ip) File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 124, in nb_update_vs nb_service_entry_ip = str(nb_service_entry.ipaddresses[0]).split('/')[0] File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 327, in str getattr(self, "name", None) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 303, in getattr if self.full_details(): File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 459, in full_details self._parse_values(next(req.get())) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 291, in get req = self._make_call(add_params=add_params) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 258, in _make_call raise RequestError(req) pynetbox.core.query.RequestError: The request failed with code 502 Bad Gateway but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.

/var/log/nginx/error.log:

2024/01/08 14:27:35 [error] 919#919: *110908 recv() failed (104: Unknown error) while reading response header from upstream, client: 10.10.10.74, server: netbox-test.domain.test, request: "GET /api/ipam/ip-addresses/24202/ HTTP/1.1", upstream: "http://10.10.10.5:8001/api/ipam/ip-addresses/24202/", host: "netbox-test.domain.test"

gunicorn log:

Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Autorestarting worker after current request. Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Worker exiting (pid: 1129991) Jan 08 14:39:26 30001vmnb02-prod gunicorn[1139845]: [2024-01-08 14:39:26 +0100] [1139845] [INFO] Booting worker with pid: 1139845 Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Autorestarting worker after current request. Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Worker exiting (pid: 1129962) Jan 08 14:44:11 30001vmnb02-prod gunicorn[1139926]: [2024-01-08 14:44:11 +0100] [1139926] [INFO] Booting worker with pid: 1139926

Versions: Netbox: 3.6.9 Python: 3.10.12 Redis version=6.0.16 nginx version: nginx/1.18.0 (Ubuntu) psql (PostgreSQL) 14.10 (Ubuntu 14.10-0ubuntu0.22.04.1) gunicorn (version 21.2.0) pynetbox: 7.3.3

Linux vmnb02-test 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Is there a release underway to fix this or should we still refrain from upgrading? Current state of gunicorn dictates that it is not production-worthy. :(

I have been testing a few things and this is my finding. It doesn't help if I decrease the max_requests setting. It just fails sooner. it doesn't help if I increase the max_requests setting either. It still fails at some point. When it fails a log-entry about restarting shows up simultaneously:

Example:

an 09 13:20:07 30001vmnb02-test gunicorn[106082]: [2024-01-09 13:20:07 +0100] [106082] [INFO] Autorestarting worker after current request. Jan 09 13:20:07 30001vmnb02-test gunicorn[106082]: [2024-01-09 13:20:07 +0100] [106082] [INFO] Worker exiting (pid: 106082) Jan 09 13:20:08 30001vmnb02-test gunicorn[106256]: [2024-01-09 13:20:08 +0100] [106256] [INFO] Booting worker with pid: 106256

If I set the max_requests to 0, disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is.

Perhaps a scheduled restart of the Netbox and netbox-rq services (thereby restarting gunicorn worker processes) once a day would do the trick?

nomad-cyanide commented 10 months ago

If I set the max_requests to 0, disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is.

Perhaps a scheduled restart of the Netbox and netbox-rq services (thereby restarting gunicorn worker processes) once a day would do the trick?

I have come to the conclusion, that rather than downgrade gunicorn of maybe loose some necessary features, I will go ahead with max_requests set i 0 and if memory usage becomes an issue on the server I will set up a scheduled job that restarts the worker processes with this command:

ps -aux | grep venv/bin/gunicorn | grep Ss | awk '{ system("kill -HUP " $2 )}'

benoitc commented 10 months ago

Just don't pass the max_request option? i never use it myself. It's only when i have a temporary memory leak and never in production.

Le jeu. 11 janv. 2024 à 13:13, Ian S. @.***> a écrit :

If I set the max_requests to 0, disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is.

Perhaps a scheduled restart of the Netbox and netbox-rq services (thereby restarting gunicorn worker processes) once a day would do the trick?

I have come to the conclusion, that rather than downgrade gunicorn of maybe loose some necessary features, I will go ahead with max_requests set i 0 and if memory usage becomes an issue on the server I will set up a scheduled job that restarts the worker processes with this command:

ps -aux | grep venv/bin/gunicorn | grep Ss | awk '{ system("kill -HUP " $2 )}'

— Reply to this email directly, view it on GitHub https://github.com/benoitc/gunicorn/issues/3038#issuecomment-1887034359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADRIW3Y346USVNBLVBBXLYN7JP7AVCNFSM6AAAAAA2SEKTSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXGAZTIMZVHE . You are receiving this because you were mentioned.Message ID: @.***>

zogzog commented 9 months ago

For the record, there exists a variety of situation where the memory leaks are difficult to address:

they're costly to track and fix (in our code bases, provided it comes from there)
they may be in third-party libs

rajivramanathan commented 7 months ago

We stayed on a lower release version to avoid this issue. However, we have to upgrade due to HTTP Request Smuggling (CVE-2024-1135) vulnerability. Is there anyone able to successfully workaround this issue (short of turning off max-requests)?

benoitc commented 6 months ago

@rajivramanathan don't use max-requests? Max requests is there for the worst casse when your application leaks.

zogzog commented 6 months ago

Benoît, I think we understand your advice, but many apps may find themselves in the "application leaks and we can't do much about it" place, hence the usefulness of max-requests.

hansukyang commented 6 months ago

We stayed on a lower release version to avoid this issue. However, we have to upgrade due to HTTP Request Smuggling (CVE-2024-1135) vulnerability. Is there anyone able to successfully workaround this issue (short of turning off max-requests)?

We have NGINX in front of Gunicorn so we addressed it by setting up multiple instances of Gunicorn running upstream listening to different ports and using http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream configuration to try next upstream if we encounter 502 error.

martinkirch-regmind commented 2 months ago

Hello, In our setting max-request is very helpful too, and we ran into the same issue when upgrading to gunicorn>=21. I wanted to mention that #3157 fixes it !

zogzog commented 2 months ago

Just for the record, there exists a situation where the Python default memory allocator produces (under specific circumstances) very fragmented arenas which leads to the interpreter not giving back unused memory - and this might be our case. Using jemalloc (see https://zapier.com/engineering/celery-python-jemalloc/) may alleviate the issue. We are considering this, too. However if https://github.com/benoitc/gunicorn/pull/3157 is green, we will be happy to keep using max-requests.

charlesoconor commented 2 weeks ago

Have run into this in my production so would love to see the patch merged. Have pulled it in and have seen a reduction in 503s from the connection reset.

benoitc / gunicorn

Connection reset during max-requests auto-restart with gthread #3038

Application

Configuration

Reproduction

Quickstart

Example

Logs

Expected

Actual

Workaround

3039 fixes this issue BUT for me the idle load goes up from 0.x to 5!