Truncated or oversized response headers received from daemon process (Apache, Python 3.11, hgweb)

myieye commented 7 months ago

Hi there,

I'd be very grateful for any help someone may have to offer us on this!

We've built our own mercurial server around hgweb and are using wsgi. Usually everything works fine, but a couple times a week we run into this error:

Truncated or oversized response headers received from daemon process 'hgweb'

As a result the hgweb process dies and we end up with a corrupted mercurial repository that requires some work.

Here's the gist of what we're using: Apache/2.4.58 (Unix) mod_wsgi/4.9.4 Python/3.11

And our Apache config:

LoadModule wsgi_module /usr/lib/apache2/modules/mod_wsgi.so
WSGIPythonOptimize 0
WSGILazyInitialization On
WSGISocketPrefix logs/wsgi
LogLevel debug
ServerName localhost
<VirtualHost *>
    LogLevel info
# defaults from docs
    WSGIDaemonProcess hgweb processes=1 threads=15 display-name=%{GROUP} \
    queue-timeout=45 socket-timeout=60 connect-timeout=15 request-timeout=600 \
    inactivity-timeout=0 deadlock-timeout=15 graceful-timeout=15 \
    eviction-timeout=0 restart-interval=0 shutdown-timeout=5 maximum-requests=0

    WSGIProcessGroup hgweb

    WSGIScriptAlias /hg /usr/local/www/wsgi-scripts/hgweb.wsgi
    <Directory /usr/local/www/wsgi-scripts>
        <IfVersion < 2.4>
            Order allow,deny
            Allow from all
        </IfVersion>
        <IfVersion >= 2.4>
            Require all granted
        </IfVersion>
    </Directory>

    ScriptAliasMatch "^/command/(.*)" "/usr/local/www/commands/command-runner.sh/$1"
    <Directory /usr/local/www/commands/>
        Options +ExecCGI
        AddHandler cgi-script .sh
        Require all granted
    </Directory>
</VirtualHost>

My impression is that most users that encounter this error have it consistently/can't use wsgi at all, but for us it's very unpredictable and so far we haven't found a way to reproduce it yet. So, I don't expect it has anything to do with the python packages we're using and unfortunately we have no idea how to debug it.

We've now instrumented wsgi as well in the hope that that would help us uncover the root cause, but that hasn't proved to be very meaningful.

Here's a larger log snippet. As you can see we have Apache instrumented with Open Telemetry:

[Wed Feb 21 14:47:14.000296 2024] [wsgi:error] [pid 137:tid 140153817757376] [client 10.42.6.14:33376] Truncated or oversized response headers received from daemon process 'hgweb': /usr/local/www/wsgi-scripts/hgweb.wsgi
[Wed Feb 21 14:47:14.000291 2024] [wsgi:error] [pid 12:tid 140153784186560] (104)Connection reset by peer: [client 10.42.6.14:33382] mod_wsgi (pid=12): Failed to proxy response from daemon.
10.42.6.14 - - [21/Feb/2024:14:47:08 +0000] "POST /hg/repo-x?cmd=unbundle HTTP/1.1" 500 528
10.42.6.14 - - [21/Feb/2024:14:47:13 +0000] "GET /hg/repo-y?cmd=getbundle HTTP/1.1" 200 131929
[Wed Feb 21 14:47:14.025915 2024] [otel_apache:error] [pid 12:tid 140153784186560] mod_opentelemetry_webserver_sdk: otel_hook_log_transaction_end: request end successful (HTTP status=200)
[Wed Feb 21 14:47:14.025926 2024] [otel_apache:error] [pid 137:tid 140153817757376] mod_opentelemetry_webserver_sdk: otel_hook_log_transaction_end: request end successful (HTTP status=500)
[Wed Feb 21 14:47:14.026643 2024] [otel_apache:error] [pid 12:tid 140153356015296] mod_opentelemetry_webserver_sdk: initialize_opentelemetry: config retrieved from process memory pool
[Wed Feb 21 14:47:14.026667 2024] [otel_apache:error] [pid 12:tid 140153356015296] mod_opentelemetry_webserver_sdk: otel_hook_header_parser_begin: lexbox:hgweb:v2024-02-14-0ac6cbc6
[Wed Feb 21 14:47:14.028309 2024] [otel_apache:error] [pid 12:tid 140153356015296] mod_opentelemetry_webserver_sdk: otel_hook_header_parser_begin: request begin successful
[Wed Feb 21 14:47:14.036986 2024] [otel_apache:error] [pid 12:tid 140153356015296] mod_opentelemetry_webserver_sdk: otel_stopInteraction: result code: 25
[Wed Feb 21 14:47:14.037111 2024] [otel_apache:error] [pid 12:tid 140153356015296] mod_opentelemetry_webserver_sdk: otel_payload_decorator: correlation information : "tracestate" : ""
[Wed Feb 21 14:47:14.037127 2024] [otel_apache:error] [pid 12:tid 140153356015296] mod_opentelemetry_webserver_sdk: otel_payload_decorator: correlation information : "traceparent" : "00-46d991803316d918f9b8088c91ac08b8-e11df2bd1b0e55f8-01"
[Wed Feb 21 14:47:14.037131 2024] [otel_apache:error] [pid 12:tid 140153356015296] mod_opentelemetry_webserver_sdk: mod_cgid.c: interaction begin successful
[Wed Feb 21 14:47:14.038449 2024] [otel_apache:error] [pid 12:tid 140153356015296] mod_opentelemetry_webserver_sdk: otel_payload_decorator: correlation information : "tracestate" : ""
[Wed Feb 21 14:47:14.038461 2024] [otel_apache:error] [pid 12:tid 140153356015296] mod_opentelemetry_webserver_sdk: otel_payload_decorator: correlation information : "traceparent" : "00-46d991803316d918f9b8088c91ac08b8-d565d666afee5b9e-01"
[Wed Feb 21 14:47:14.038464 2024] [otel_apache:error] [pid 12:tid 140153356015296] mod_opentelemetry_webserver_sdk: mod_wsgi.c: interaction begin successful
[Wed Feb 21 14:47:14.766353 2024] [wsgi:info] [pid 1:tid 140153902012288] mod_wsgi (pid=1588): Process 'hgweb' has died, deregister and restart it.
[Wed Feb 21 14:47:14.766399 2024] [wsgi:info] [pid 1:tid 140153902012288] mod_wsgi (pid=1588): Process 'hgweb' terminated by signal 9
[Wed Feb 21 14:47:14.766401 2024] [wsgi:info] [pid 1:tid 140153902012288] mod_wsgi (pid=1588): Process 'hgweb' has been deregistered and will no longer be monitored.

Lines 3 and 4 are requests to two different repos. And that seems to be a bit of a pattern. Is it at all possible that the header-buffer can be affected my simultaneous requests?

Here are some more occurences:

[Sat Feb 24 02:03:02.259434 2024] [wsgi:error] [pid 137:tid 140153012078272] [client 10.42.6.14:41134] Truncated or oversized response headers received from daemon process 'hgweb': /usr/local/www/wsgi-scripts/hgweb.wsgi
[Sat Feb 24 02:03:02.258689 2024] [wsgi:error] [pid 12:tid 140152408098496] (104)Connection reset by peer: [client 10.42.6.14:33748] mod_wsgi (pid=12): Failed to proxy response from daemon.
10.42.6.14 - - [24/Feb/2024:01:57:27 +0000] "GET /hg/repo-a?cmd=getbundle HTTP/1.1" 200 31828672
10.42.6.14 - - [24/Feb/2024:02:02:58 +0000] "POST /hg/repo-b?cmd=unbundle HTTP/1.1" 500 528

[Mon Feb 26 12:41:57.757908 2024] [wsgi:error] [pid 12:tid 140153322444480] (104)Connection reset by peer: [client 10.42.6.14:46248] mod_wsgi (pid=12): Failed to proxy response from daemon.
[Mon Feb 26 12:41:57.757909 2024] [wsgi:error] [pid 137:tid 140152290666176] [client 10.42.6.14:35910] Truncated or oversized response headers received from daemon process 'hgweb': /usr/local/www/wsgi-scripts/hgweb.wsgi
10.42.6.14 - - [26/Feb/2024:12:40:41 +0000] "GET /hg/repo-c?cmd=getbundle HTTP/1.1" 200 15873095
10.42.6.14 - - [26/Feb/2024:12:41:57 +0000] "POST /hg/repo-d?cmd=unbundle HTTP/1.1" 500 528

But this occurence doesn't follow that pattern. At least not as strictly:

10.42.6.14 - - [27/Feb/2024:09:39:12 +0000] "GET /hg/repo-g?cmd=listkeys HTTP/1.1" 200 -
[Tue Feb 27 09:39:12.529821 2024] [otel_apache:error] [pid 137:tid 140153012078272] mod_opentelemetry_webserver_sdk: otel_hook_log_transaction_end: request end successful (HTTP status=200)
[Tue Feb 27 09:39:14.577683 2024] [otel_apache:error] [pid 137:tid 140153037256384] mod_opentelemetry_webserver_sdk: initialize_opentelemetry: config retrieved from process memory pool
[Tue Feb 27 09:39:14.577728 2024] [otel_apache:error] [pid 137:tid 140153037256384] mod_opentelemetry_webserver_sdk: otel_hook_header_parser_begin: lexbox:hgweb:v2024-02-14-0ac6cbc6
[Tue Feb 27 09:39:14.578087 2024] [otel_apache:error] [pid 137:tid 140153037256384] mod_opentelemetry_webserver_sdk: otel_hook_header_parser_begin: request begin successful
[Tue Feb 27 09:39:14.578138 2024] [otel_apache:error] [pid 137:tid 140153037256384] mod_opentelemetry_webserver_sdk: otel_stopInteraction: result code: 25
[Tue Feb 27 09:39:14.578217 2024] [otel_apache:error] [pid 137:tid 140153037256384] mod_opentelemetry_webserver_sdk: otel_payload_decorator: correlation information : "tracestate" : ""
[Tue Feb 27 09:39:14.578225 2024] [otel_apache:error] [pid 137:tid 140153037256384] mod_opentelemetry_webserver_sdk: otel_payload_decorator: correlation information : "traceparent" : "00-50511174d079136b6936b5cfe7589764-aa5c57bdec0609a7-01"
[Tue Feb 27 09:39:14.578227 2024] [otel_apache:error] [pid 137:tid 140153037256384] mod_opentelemetry_webserver_sdk: mod_cgid.c: interaction begin successful
[Tue Feb 27 09:39:14.578288 2024] [otel_apache:error] [pid 137:tid 140153037256384] mod_opentelemetry_webserver_sdk: otel_payload_decorator: correlation information : "tracestate" : ""
[Tue Feb 27 09:39:14.578291 2024] [otel_apache:error] [pid 137:tid 140153037256384] mod_opentelemetry_webserver_sdk: otel_payload_decorator: correlation information : "traceparent" : "00-50511174d079136b6936b5cfe7589764-2fa7a4b9a4c053e1-01"
[Tue Feb 27 09:39:14.578293 2024] [otel_apache:error] [pid 137:tid 140153037256384] mod_opentelemetry_webserver_sdk: mod_wsgi.c: interaction begin successful
[Tue Feb 27 09:39:16.315745 2024] [wsgi:error] [pid 137:tid 140153037256384] [client 10.42.6.14:47428] Truncated or oversized response headers received from daemon process 'hgweb': /usr/local/www/wsgi-scripts/hgweb.wsgi
[Tue Feb 27 09:39:16.336861 2024] [otel_apache:error] [pid 137:tid 140153037256384] mod_opentelemetry_webserver_sdk: otel_hook_log_transaction_end: request end successful (HTTP status=500)
[Tue Feb 27 09:39:16.338527 2024] [otel_apache:error] [pid 137:tid 140152995292864] mod_opentelemetry_webserver_sdk: initialize_opentelemetry: config retrieved from process memory pool
[Tue Feb 27 09:39:16.338549 2024] [otel_apache:error] [pid 137:tid 140152995292864] mod_opentelemetry_webserver_sdk: otel_hook_header_parser_begin: lexbox:hgweb:v2024-02-14-0ac6cbc6
[Tue Feb 27 09:39:16.339374 2024] [otel_apache:error] [pid 137:tid 140152995292864] mod_opentelemetry_webserver_sdk: otel_hook_header_parser_begin: request begin successful
10.42.6.14 - - [27/Feb/2024:09:39:14 +0000] "POST /hg/repo-h?cmd=unbundle HTTP/1.1" 500 528
[Tue Feb 27 09:39:16.346131 2024] [otel_apache:error] [pid 137:tid 140152995292864] mod_opentelemetry_webserver_sdk: otel_stopInteraction: result code: 25
[Tue Feb 27 09:39:16.346242 2024] [otel_apache:error] [pid 137:tid 140152995292864] mod_opentelemetry_webserver_sdk: otel_payload_decorator: correlation information : "tracestate" : ""
[Tue Feb 27 09:39:16.346251 2024] [otel_apache:error] [pid 137:tid 140152995292864] mod_opentelemetry_webserver_sdk: otel_payload_decorator: correlation information : "traceparent" : "00-50511174d079136b6936b5cfe7589764-e5f066c5679610a2-01"
[Tue Feb 27 09:39:16.346254 2024] [otel_apache:error] [pid 137:tid 140152995292864] mod_opentelemetry_webserver_sdk: mod_cgid.c: interaction begin successful
[Tue Feb 27 09:39:16.347597 2024] [otel_apache:error] [pid 137:tid 140152995292864] mod_opentelemetry_webserver_sdk: otel_payload_decorator: correlation information : "tracestate" : ""
[Tue Feb 27 09:39:16.347608 2024] [otel_apache:error] [pid 137:tid 140152995292864] mod_opentelemetry_webserver_sdk: otel_payload_decorator: correlation information : "traceparent" : "00-50511174d079136b6936b5cfe7589764-8c9449f83e75d298-01"
[Tue Feb 27 09:39:16.347611 2024] [otel_apache:error] [pid 137:tid 140152995292864] mod_opentelemetry_webserver_sdk: mod_wsgi.c: interaction begin successful
[Tue Feb 27 09:39:17.018221 2024] [wsgi:info] [pid 1:tid 140153902012288] mod_wsgi (pid=2416): Process 'hgweb' has died, deregister and restart it.
[Tue Feb 27 09:39:17.018257 2024] [wsgi:info] [pid 1:tid 140153902012288] mod_wsgi (pid=2416): Process 'hgweb' terminated by signal 9
[Tue Feb 27 09:39:17.018261 2024] [wsgi:info] [pid 1:tid 140153902012288] mod_wsgi (pid=2416): Process 'hgweb' has been deregistered and will no longer be monitored.
[Tue Feb 27 09:39:17.023672 2024] [wsgi:info] [pid 2445:tid 140153902012288] mod_wsgi (pid=2445): Starting process 'hgweb' with uid=33, gid=33 and threads=15.
[Tue Feb 27 09:39:17.028843 2024] [wsgi:info] [pid 2445:tid 140153902012288] mod_wsgi (pid=2445): Initializing Python.
[Tue Feb 27 09:39:17.101503 2024] [wsgi:info] [pid 2445:tid 140153902012288] mod_wsgi (pid=2445): Attach interpreter ''.
[Tue Feb 27 09:39:17.131731 2024] [wsgi:info] [pid 2445:tid 140153826150080] mod_wsgi (pid=2445): Create interpreter 'localhost:8088|/hg'.
[Tue Feb 27 09:39:17.136913 2024] [wsgi:info] [pid 2445:tid 140153826150080] [remote 10.42.6.14:43624] mod_wsgi (pid=2445, process='hgweb', application='localhost:8088|/hg'): Loading Python script file '/usr/local/www/wsgi-scripts/hgweb.wsgi'.
10.42.6.14 - - [27/Feb/2024:09:39:16 +0000] "GET /hg/repo-j/log?style=json-lex&rev=tip HTTP/1.1" 200 429

GrahamDumpleton commented 7 months ago

For a start, you should force use of the main Python interpreter context as various Python packages which use C extensions do not work properly in Python sub interpreters, which can result in deadlocks or crashes. So set:

WSGIApplicationGroup %{GLOBAL}

See:

https://modwsgi.readthedocs.io/en/master/user-guides/application-issues.html#python-simplified-gil-state-api

Two other reasons why you may see daemon processes crash are:

System OOM (out of memory) process killer on Linux systems is killing the daemon processes.
System logrotate service forcibly restarting Apache periodically rather than allowing Apache to rely on its own log file rotation mechanism which doesn't cause process restarts.

So check how much memory the daemon processes are using and whether system is getting low on memory.

Check the main Apache error log file (not virtual host specific logs), for any messages which suggest daemon processes crashed due to core dump/segmentation fault.

hahn-kev commented 7 months ago

Thanks Graham, I'm on the same team as myieye. Thanks for your help.

I'll go ahead and set the application group. But if that was the case wouldn't the crash be consistent?

I don't think it's an OOM issue as we're running this in k8s and we have a memory limit set to 400mb, if it goes over that limit then the pod would be restarted and we would see a warning about that, I don't think we've ever had the pod restarted so I don't think it's out of memory.

Since this is a docker container based on httpd:2.4-bookworm I'm fairly certain logrotate isn't running, it's not even installed.

GrahamDumpleton commented 7 months ago

Crashes when using C extension based Python packages in a sub interpreter, when they are not designed to be, need not be consistent and reproducible. All depends on what the C extension does. So try setting WSGIApplicationGroup as explained as first step and then can see how things go.

hahn-kev commented 7 months ago

It looks like this was an out of memory issue, we increased the memory available on that k8s pod and it seems to be working better.

GrahamDumpleton commented 3 weeks ago

Closing this issue at this point.

Note that another reason one may see the original error messages is if request-timeout feature had been triggered. In this case you should also see messages in the logs indicating a request timeout occurred.

GrahamDumpleton / mod_wsgi

Truncated or oversized response headers received from daemon process (Apache, Python 3.11, hgweb) #876