Connection drops on hot restart

A client using HTTP/1.1 to send multiple successive requests to an Envoy can see its connection drop in the middle of a request when the Envoy is hot-restarted.

My reproduction case is on Kubernetes, running with a custom Ambassador Envoy image that includes PIDs in the logs:

kubectl run atest --labels="app=ambassador" --image=dwflynn/ambassador:pidlog --command -- /bin/sh -c "sleep 3600"

That image includes a stripped debug-built Envoy running on Alpine, with logger.cc changed to include

logger_->set_pattern("[%Y-%m-%d %T.%e][%P][%t][%l][%n] %v");

Ambassador's extauth filter is not in use. In fact Ambassador isn't running; I use kubectl exec to get a shell on my pod and then run

python3 hot-restart.py start-envoy.sh

where hot-restart.py and start-envoy.sh are both included in the attached ZIP archive.

Envoy ends up running as

/usr/local/bin/envoy -c /etc/envoy-1.json -l trace --restart-epoch 0 --drain-time-s 2 --parent-shutdown-time-s 3 --log-path /application/envoy.log

and I then run - on my laptop - a single instance of loop.py, also included in the ZIP file, but reproduced here too:

import sys

import logging

logging.basicConfig(level=logging.DEBUG)

import requests

session = requests.Session()

while True:
    r = session.get(sys.argv[1])

    # logging.info("r.status_code %d" % r.status_code)

    if r.status_code != 200:
        logging.info("bzzt")

With the single instance of loop.py running, send a SIGHUP to hot-restarter.py. It won't take many restarts before loop.py crashes, complaining that Remote end closed connection without response.

Note that digging into the Envoy's trace logs, nothing looks obviously wrong: I see the new Envoy process picking up connections (and the old process not picking them up) after the restart, and I see write(2) returning values that look correct and don't look like things are getting dropped. But this doesn't seem like expected behavior, either.

envoy-report.zip

@kflynn Did you try to reproduce this issue with an older SHA? I was wondering if this is related https://github.com/envoyproxy/envoy/commit/1db0c7ace77263c554d18dfffa21ea2b5ea6f3b9

Sorry, should've specified that I built from a branch merged only up to 0ab49ee302b933e7da875405269d33b16e4e0bd4 on 15 February, so not a factor, @gsagula.

Also, we've reproduced it entirely within Docker at this point...

To replicate this entirely inside a Docker container:

Terminal 1

$ docker run --rm -it --entrypoint sh --name conndrop dwflynn/ambassador:pidlog
/application # wget -q https://github.com/envoyproxy/envoy/files/1803492/envoy-report.zip
/application # unzip -q envoy-report.zip
/application # cd envoy-report/
/application/envoy-report # # Point cluster_qotm to the admin service
/application/envoy-report # sed -i.orig -e s,tcp://qotm:80,tcp://127.0.0.1:8001, envoy-1.json
/application/envoy-report # cp envoy-1.json /etc/
/application/envoy-report # python3 hot-restarter.py start-envoy.sh
starting hot-restarter with target: start-envoy.sh
forking and execing new child process at epoch 0
forked new child process with PID=16

Terminal 2


$ docker exec -it conndrop sh
/application # curl -s http://localhost:80/qotm/ | head -3

Envoy Admin /application # cd envoy-report/ /application/envoy-report # python3 loop.py http://localhost:80/qotm/ DEBUG:urllib3.connectionpool:http://localhost:80 "GET /qotm/ HTTP/1.1" 200 None DEBUG:urllib3.connectionpool:http://localhost:80 "GET /qotm/ HTTP/1.1" 200 None DEBUG:urllib3.connectionpool:http://localhost:80 "GET /qotm/ HTTP/1.1" 200 None [...] ``` This will loop as fast as Python-ly possible until something interrupts it. > Terminal 3 ```shell $ docker exec -it conndrop sh /application # pgrep -alf hot-restarter 15 python3 hot-restarter.py start-envoy.sh /application # kill -HUP 15 /application # kill -HUP 15 /application # kill -HUP 15 /application # kill -HUP 15 /application # kill -HUP 15 /application # kill -HUP 15 /application # kill -HUP 15 /application # # Take a peek over at Terminal 2 every now and then. ``` At this point I found that loop.py had crashed in Terminal 2 as described. > Terminal 2 ```shell [...] DEBUG:urllib3.connectionpool:http://localhost:80 "GET /qotm/ HTTP/1.1" 200 None DEBUG:urllib3.connectionpool:http://localhost:80 "GET /qotm/ HTTP/1.1" 200 None DEBUG:urllib3.connectionpool:http://localhost:80 "GET /qotm/ HTTP/1.1" 200 None Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen chunked=chunked) File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request six.raise_from(e, None) File "", line 2, in raise_from File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request httplib_response = conn.getresponse() File "/usr/lib/python3.6/http/client.py", line 1331, in getresponse response.begin() File "/usr/lib/python3.6/http/client.py", line 297, in begin version, status, reason = self._read_status() File "/usr/lib/python3.6/http/client.py", line 266, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response [...] ```

@kflynn @ark3 what is the behavior of Envoy here? Can you look at Envoy logs? Is it crashing? Is it just not flushing a response? Can you try to gather some more info?

@mattklein123 Here's the tail end of my Kubernetes run, included in the attached envoy-report.zip:

[2018-03-12 15:28:47.056][188][188][info][main] source/server/hot_restart_impl.cc:442] shutting down due to child request
[2018-03-12 15:28:47.056][188][188][warning][main] source/server/server.cc:312] caught SIGTERM
[2018-03-12 15:28:47.056][188][188][info][main] source/server/server.cc:357] main dispatch loop exited
[2018-03-12 15:28:47.056][188][198][trace][connection] source/common/network/connection_impl.cc:386] [C1] socket event: 3
[2018-03-12 15:28:47.056][188][197][debug][main] source/server/worker_impl.cc:95] worker exited dispatch loop
[2018-03-12 15:28:47.056][188][198][trace][connection] source/common/network/connection_impl.cc:454] [C1] write ready
[2018-03-12 15:28:47.057][188][198][trace][connection] source/common/network/connection_impl.cc:424] [C1] read ready
[2018-03-12 15:28:47.057][188][198][trace][connection] source/common/network/raw_buffer_socket.cc:25] [C1] read returns: 154
[2018-03-12 15:28:47.057][188][197][debug][grpc] source/common/grpc/google_async_client_impl.cc:24] Joining completionThread
[2018-03-12 15:28:47.057][188][198][trace][connection] source/common/network/raw_buffer_socket.cc:25] [C1] read returns: -1
[2018-03-12 15:28:47.057][188][198][trace][connection] source/common/network/raw_buffer_socket.cc:33] [C1] read error: 11
[2018-03-12 15:28:47.057][188][200][debug][grpc] source/common/grpc/google_async_client_impl.cc:56] completionThread exiting
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:298] [C1] parsing 154 bytes
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:186] [C1] new stream
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:285] [C1] completed header: key=Host value=54.90.170.39:31521
[2018-03-12 15:28:47.057][188][197][debug][grpc] source/common/grpc/google_async_client_impl.cc:26] Joined completionThread
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:285] [C1] completed header: key=User-Agent value=python-requests/2.18.4
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:285] [C1] completed header: key=Accept-Encoding value=gzip, deflate
[2018-03-12 15:28:47.057][188][197][debug][upstream] source/common/upstream/cluster_manager_impl.cc:587] shutting down thread local cluster manager
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:285] [C1] completed header: key=Accept value=*/*
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:354] [C1] headers complete
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:285] [C1] completed header: key=Connection value=keep-alive
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:536] [C1] message complete
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:447] [C1][S4054012562118838635] request headers complete (end_stream=true):
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635]   ':authority':'54.90.170.39:31521'
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635]   'user-agent':'python-requests/2.18.4'
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635]   'accept-encoding':'gzip, deflate'
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635]   'accept':'*/*'
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635]   'connection':'keep-alive'
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635]   ':path':'/qotm/'
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635]   ':method':'GET'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:249] [C1][S4054012562118838635] cluster 'cluster_qotm' match for URL '/qotm/'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635]   ':authority':'54.90.170.39:31521'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635]   'user-agent':'python-requests/2.18.4'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635]   'accept-encoding':'gzip, deflate'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635]   'accept':'*/*'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635]   ':path':'/'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635]   ':method':'GET'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635]   'x-forwarded-proto':'http'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635]   'x-request-id':'8214e79a-057b-4826-8632-67f3f3e9afef'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635]   'x-envoy-expected-rq-timeout-ms':'3000'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635]   'x-envoy-original-path':'/qotm/'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635]   ':scheme':'http'
[2018-03-12 15:28:47.057][188][198][debug][pool] source/common/http/http1/conn_pool.cc:73] creating a new connection
[2018-03-12 15:28:47.057][188][198][debug][client] source/common/http/codec_client.cc:23] [C210] connecting
[2018-03-12 15:28:47.057][188][198][debug][connection] source/common/network/connection_impl.cc:568] [C210] connecting to 10.97.38.12:80
[2018-03-12 15:28:47.057][188][198][debug][connection] source/common/network/connection_impl.cc:577] [C210] connection in progress
[2018-03-12 15:28:47.057][188][198][debug][pool] source/common/http/http1/conn_pool.cc:99] queueing request due to no available connections
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:644] [C1][S4054012562118838635] decode headers called: filter=0x2544c80 status=1
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635]   H':authority':'54.90.170.39:31521'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635]   H'user-agent':'python-requests/2.18.4'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635]   H'accept-encoding':'gzip, deflate'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635]   H'accept':'*/*'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635]   H':path':'/'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635]   H':method':'GET'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635]   H'x-forwarded-proto':'http'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635]   H'x-request-id':'8214e79a-057b-4826-8632-67f3f3e9afef'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635]   H'x-envoy-expected-rq-timeout-ms':'3000'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635]   H'x-envoy-original-path':'/qotm/'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635]   H':scheme':'http'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:315] [C1] parsed 154 bytes
[2018-03-12 15:28:47.057][188][198][trace][connection] source/common/network/connection_impl.cc:229] [C1] readDisable: enabled=true disable=true
[2018-03-12 15:28:47.057][188][198][debug][main] source/server/worker_impl.cc:95] worker exited dispatch loop
[2018-03-12 15:28:47.057][188][198][debug][connection] source/common/network/connection_impl.cc:99] [C1] closing data_to_write=0 type=1
[2018-03-12 15:28:47.057][188][198][debug][connection] source/common/network/connection_impl.cc:134] [C1] closing socket: 1
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:895] [C1][S4054012562118838635] cancelling pool request
[2018-03-12 15:28:47.057][188][198][debug][pool] source/common/http/http1/conn_pool.cc:193] cancelling pending request
[2018-03-12 15:28:47.057][188][198][trace][main] source/common/event/dispatcher_impl.cc:126] item added to deferred deletion list (size=1)
[2018-03-12 15:28:47.057][188][198][debug][main] source/server/connection_handler_impl.cc:54] [C1] adding to cleanup list
[2018-03-12 15:28:47.057][188][198][trace][main] source/common/event/dispatcher_impl.cc:126] item added to deferred deletion list (size=2)
[2018-03-12 15:28:47.057][188][198][trace][main] source/common/event/dispatcher_impl.cc:52] clearing deferred deletion list (size=2)
[2018-03-12 15:28:47.057][188][198][debug][grpc] source/common/grpc/google_async_client_impl.cc:24] Joining completionThread
[2018-03-12 15:28:47.057][188][199][debug][grpc] source/common/grpc/google_async_client_impl.cc:56] completionThread exiting
[2018-03-12 15:28:47.057][188][198][debug][grpc] source/common/grpc/google_async_client_impl.cc:26] Joined completionThread
[2018-03-12 15:28:47.057][188][198][debug][upstream] source/common/upstream/cluster_manager_impl.cc:587] shutting down thread local cluster manager
[2018-03-12 15:28:47.057][188][198][debug][connection] source/common/network/connection_impl.cc:99] [C210] closing data_to_write=0 type=1
[2018-03-12 15:28:47.058][188][198][debug][connection] source/common/network/connection_impl.cc:134] [C210] closing socket: 1
[2018-03-12 15:28:47.058][188][198][debug][client] source/common/http/codec_client.cc:70] [C210] disconnect. resetting 0 pending requests
[2018-03-12 15:28:47.058][188][198][debug][pool] source/common/http/http1/conn_pool.cc:115] [C210] client disconnected
[2018-03-12 15:28:47.058][188][198][trace][main] source/common/event/dispatcher_impl.cc:126] item added to deferred deletion list (size=1)
[2018-03-12 15:28:47.058][188][198][trace][main] source/common/event/dispatcher_impl.cc:52] clearing deferred deletion list (size=1)
[2018-03-12 15:28:47.058][188][188][debug][grpc] source/common/grpc/google_async_client_impl.cc:24] Joining completionThread
[2018-03-12 15:28:47.058][188][195][debug][grpc] source/common/grpc/google_async_client_impl.cc:56] completionThread exiting
[2018-03-12 15:28:47.058][188][188][debug][grpc] source/common/grpc/google_async_client_impl.cc:26] Joined completionThread
[2018-03-12 15:28:47.058][188][188][debug][upstream] source/common/upstream/cluster_manager_impl.cc:587] shutting down thread local cluster manager
[2018-03-12 15:28:47.058][188][188][info][main] source/server/server.cc:392] exiting

This looks to me like Envoy is, for whatever reason, dropping a request on C1 and closing its socket before even forwarding the request upstream, much less getting a response.

Note that this is not always what I've seen in the Envoy logs. In other cases, the logs seem to imply that Envoy is in fact trying to flush all the data before closing the socket, but here... that's not the case. I can rerun it a few more times, of course.

@kflynn you are using really short drain/parent shutdown times. Perhaps the requests are just not completing in time?

Individual requests usually complete in under 10ms... and by design, my connection will never complete at all until Envoy shuts it down. So while the drain time is deliberately short, it seems that it can't be the real culprit.

After digging into this a bit further, it seems that the root of the problem is that Envoy should be sending Connection: close when it's about to shut the connection down, per RFC7230 §6.6 (which goes on to mention that it can be risky for the server to simply summarily close the connection). Right now, Envoy simply summarily closes the connection.

Unraveling a few layers of Python (when one has to git clone cpython, one should really question one's life choices :) ), when the server summarily closes the connection like Envoy does, it's basically pure luck whether Python requests will notice that the connection has dropped before trying to send the next request. If it does, it will open a new connection and all is well. If not, it crashes because it can't distinguish between the request getting no response and the request never sending at all because the connection was already closed.

Sadly, that's really a POSIX thing rather than requests getting it wrong, so I think that many clients will have the same problem. We've already seen it in both Go and Python, using popular client libraries.

When the server does send Connection: close, though, things seem much happier. I ran a test with a custom HTTP server running on localhost, so Envoy and the entire external network aren't in the picture at all. This test-close.py server is attached.

If you run python test-close.py no in one window and python loop.py http://localhost:3000/ in another, then loop.py will eventually crash with that Remote end closed connection without response error.
If you run python test-close.py yes, though, loop.py just keeps on going.

(Amusingly, the crash happens much more quickly with my laptop on battery power -- usually in fewer than 200 requests. Go figure. On the other hand my yes run, also on battery power, is up to nearly 140K requests so far and still going strong...)

test-close.py.zip

Envoy will send connection: close when it is doing a full drain close of a connection. It can only do this when it is sending a response.

I'm not sure the sequence of events that is causing this, but the root cause is your drain time is too short. Please make it longer and see if the issue still happens.

With all respect, @mattklein123, too short for what exactly? :smile:

Remember that we have only one connection active and only one request in flight -- so there's only one stream active in Envoy, and in my test it has a typical lifespan of less than 20ms. And, in fact, I see Envoy logging drain closing connection, but not doing anything about that.

I just reran the test to be able to build a timeline. We start with Envoy PID 263 active, handling requests on [C1]. I HUP the hot-restarter at about 15:11:34.986. Right after that a new request comes in, which makes sense, and it takes 15 ms from the point we identify it as a new stream to the point that the response is written downstream:

15:11:34.986 PID 263 new stream on [C1]
15:11:34.997 PID 263 upstream headers complete
15:11:34.999 PID 263 write(2) on [C1]
15:11:35.001 PID 263 write(2) returns OK

Right after that, our new PID 276 gets started up and tells PID 263 to stop handling admin requests:

15:11:35.004 PID 276 initializing epoch 3 (I had to restart a couple of times, Python got lucky)
15:11:35.015 PID 263 shutting down admin due to child startup

So PID 263 knows already that there's a restart coming, but it handles another request anyway, which makes sense because we haven't seen PID 276 really get started yet.

15:11:35.021 - 15:11:35.068 PID 263 handles a request (47ms, the slowest we see in this run)

Then we have a couple of things overlapping: PID 276 gets its dispatcher going while PID 263 starts handling a new stream:

15:11:35.069 PID 276 loading 1 listener(s)
15:11:35.091 PID 263 [C1] new stream
15:11:35.097 PID 276 starting main dispatch loop
15:11:35.100 PID 263 closing and draining listeners
15:11:35.100 PID 263 drain tick #0

PID 263 is already 9ms into handling a stream, and in fact has already made a connection to the upstream, so I wouldn't have expected to see any special handling for that request already in flight. But nope, I'm wrong about that:

-15:11:35.107 PID 263 [C1][S15918730273159564041] drain closing connection

However, there is still no Connection: close. Here's the full log sequence for the downstream headers going out:

[2018-03-19 15:11:35.107][263][271][trace][http] source/common/http/conn_manager_impl.cc:862] [C1][S15918730273159564041] encode headers called: filter=0x3352d80 status=0
[2018-03-19 15:11:35.107][263][271][debug][http] source/common/http/conn_manager_impl.cc:890] [C1][S15918730273159564041] drain closing connection
[2018-03-19 15:11:35.107][263][271][debug][http] source/common/http/conn_manager_impl.cc:942] [C1][S15918730273159564041] encoding headers via codec (end_stream=false):
[2018-03-19 15:11:35.108][263][271][debug][http] source/common/http/conn_manager_impl.cc:947] [C1][S15918730273159564041]   'content-type':'application/json'
[2018-03-19 15:11:35.108][263][271][debug][http] source/common/http/conn_manager_impl.cc:947] [C1][S15918730273159564041]   'content-length':'172'
[2018-03-19 15:11:35.108][263][271][debug][http] source/common/http/conn_manager_impl.cc:947] [C1][S15918730273159564041]   'server':'envoy'
[2018-03-19 15:11:35.108][263][271][debug][http] source/common/http/conn_manager_impl.cc:947] [C1][S15918730273159564041]   'date':'Mon, 19 Mar 2018 15:11:34 GMT'
[2018-03-19 15:11:35.108][263][271][debug][http] source/common/http/conn_manager_impl.cc:947] [C1][S15918730273159564041]   ':status':'200'
[2018-03-19 15:11:35.108][263][271][debug][http] source/common/http/conn_manager_impl.cc:947] [C1][S15918730273159564041]   'x-envoy-upstream-service-time':'9'
[2018-03-19 15:11:35.108][263][271][trace][connection] source/common/network/connection_impl.cc:323] [C1] writing 158 bytes, end_stream false

I verified that the client doesn't see any headers that aren't mentioned above, so it's not a logging issue. It's another millisecond before the downstream writes complete, so the whole request lasts 18ms.

15:11:35.109 PID 263 [C1] write returns: 330

...and that's it. Envoy seems to think it's already done all it needs to, because it merrily handles 98 more requests on PID 263 [C1], none longer than 23 ms, before finally deciding that it's been long enough, shutting the connection down, and exiting.

So I can certainly increase the drain time, but to what end? Envoy figured out that it needed to drain the connection - and claimed that it would do exactly that, even - within 150ms from when I sent the HUP. What would extra drain time accomplish?

@kflynn I'm sorry but I don't have time to go through this in detail. Envoy uses an incremental drain sequence to slowly drain connections. It's quite likely that your connection is not getting selected for drain, and then the parent shutdown time is too short to allow for proper drain before the entire process shuts down.

As I've already asked, please try longer drain times and see if that helps the issue. Like potentially 5s drain, 10s parent shutdown.

@kflynn I'm going to go ahead and close this out for now. Please reopen if the longer drain times don't solve the problem or if after debugging further it looks like some type of Envoy bug.

envoyproxy / envoy

Connection drops on hot restart #2776