Closed kflynn closed 6 years ago
@kflynn Did you try to reproduce this issue with an older SHA? I was wondering if this is related https://github.com/envoyproxy/envoy/commit/1db0c7ace77263c554d18dfffa21ea2b5ea6f3b9
Sorry, should've specified that I built from a branch merged only up to 0ab49ee302b933e7da875405269d33b16e4e0bd4 on 15 February, so not a factor, @gsagula.
Also, we've reproduced it entirely within Docker at this point...
To replicate this entirely inside a Docker container:
Terminal 1
$ docker run --rm -it --entrypoint sh --name conndrop dwflynn/ambassador:pidlog /application # wget -q https://github.com/envoyproxy/envoy/files/1803492/envoy-report.zip /application # unzip -q envoy-report.zip /application # cd envoy-report/ /application/envoy-report # # Point cluster_qotm to the admin service /application/envoy-report # sed -i.orig -e s,tcp://qotm:80,tcp://127.0.0.1:8001, envoy-1.json /application/envoy-report # cp envoy-1.json /etc/ /application/envoy-report # python3 hot-restarter.py start-envoy.sh starting hot-restarter with target: start-envoy.sh forking and execing new child process at epoch 0 forked new child process with PID=16
Terminal 2
$ docker exec -it conndrop sh /application # curl -s http://localhost:80/qotm/ | head -3
@kflynn @ark3 what is the behavior of Envoy here? Can you look at Envoy logs? Is it crashing? Is it just not flushing a response? Can you try to gather some more info?
@mattklein123 Here's the tail end of my Kubernetes run, included in the attached envoy-report.zip
:
[2018-03-12 15:28:47.056][188][188][info][main] source/server/hot_restart_impl.cc:442] shutting down due to child request
[2018-03-12 15:28:47.056][188][188][warning][main] source/server/server.cc:312] caught SIGTERM
[2018-03-12 15:28:47.056][188][188][info][main] source/server/server.cc:357] main dispatch loop exited
[2018-03-12 15:28:47.056][188][198][trace][connection] source/common/network/connection_impl.cc:386] [C1] socket event: 3
[2018-03-12 15:28:47.056][188][197][debug][main] source/server/worker_impl.cc:95] worker exited dispatch loop
[2018-03-12 15:28:47.056][188][198][trace][connection] source/common/network/connection_impl.cc:454] [C1] write ready
[2018-03-12 15:28:47.057][188][198][trace][connection] source/common/network/connection_impl.cc:424] [C1] read ready
[2018-03-12 15:28:47.057][188][198][trace][connection] source/common/network/raw_buffer_socket.cc:25] [C1] read returns: 154
[2018-03-12 15:28:47.057][188][197][debug][grpc] source/common/grpc/google_async_client_impl.cc:24] Joining completionThread
[2018-03-12 15:28:47.057][188][198][trace][connection] source/common/network/raw_buffer_socket.cc:25] [C1] read returns: -1
[2018-03-12 15:28:47.057][188][198][trace][connection] source/common/network/raw_buffer_socket.cc:33] [C1] read error: 11
[2018-03-12 15:28:47.057][188][200][debug][grpc] source/common/grpc/google_async_client_impl.cc:56] completionThread exiting
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:298] [C1] parsing 154 bytes
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:186] [C1] new stream
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:285] [C1] completed header: key=Host value=54.90.170.39:31521
[2018-03-12 15:28:47.057][188][197][debug][grpc] source/common/grpc/google_async_client_impl.cc:26] Joined completionThread
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:285] [C1] completed header: key=User-Agent value=python-requests/2.18.4
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:285] [C1] completed header: key=Accept-Encoding value=gzip, deflate
[2018-03-12 15:28:47.057][188][197][debug][upstream] source/common/upstream/cluster_manager_impl.cc:587] shutting down thread local cluster manager
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:285] [C1] completed header: key=Accept value=*/*
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:354] [C1] headers complete
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:285] [C1] completed header: key=Connection value=keep-alive
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:536] [C1] message complete
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:447] [C1][S4054012562118838635] request headers complete (end_stream=true):
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635] ':authority':'54.90.170.39:31521'
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635] 'user-agent':'python-requests/2.18.4'
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635] 'accept-encoding':'gzip, deflate'
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635] 'accept':'*/*'
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635] 'connection':'keep-alive'
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635] ':path':'/qotm/'
[2018-03-12 15:28:47.057][188][198][debug][http] source/common/http/conn_manager_impl.cc:452] [C1][S4054012562118838635] ':method':'GET'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:249] [C1][S4054012562118838635] cluster 'cluster_qotm' match for URL '/qotm/'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635] ':authority':'54.90.170.39:31521'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635] 'user-agent':'python-requests/2.18.4'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635] 'accept-encoding':'gzip, deflate'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635] 'accept':'*/*'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635] ':path':'/'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635] ':method':'GET'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635] 'x-forwarded-proto':'http'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635] 'x-request-id':'8214e79a-057b-4826-8632-67f3f3e9afef'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635] 'x-envoy-expected-rq-timeout-ms':'3000'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635] 'x-envoy-original-path':'/qotm/'
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:297] [C1][S4054012562118838635] ':scheme':'http'
[2018-03-12 15:28:47.057][188][198][debug][pool] source/common/http/http1/conn_pool.cc:73] creating a new connection
[2018-03-12 15:28:47.057][188][198][debug][client] source/common/http/codec_client.cc:23] [C210] connecting
[2018-03-12 15:28:47.057][188][198][debug][connection] source/common/network/connection_impl.cc:568] [C210] connecting to 10.97.38.12:80
[2018-03-12 15:28:47.057][188][198][debug][connection] source/common/network/connection_impl.cc:577] [C210] connection in progress
[2018-03-12 15:28:47.057][188][198][debug][pool] source/common/http/http1/conn_pool.cc:99] queueing request due to no available connections
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:644] [C1][S4054012562118838635] decode headers called: filter=0x2544c80 status=1
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635] H':authority':'54.90.170.39:31521'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635] H'user-agent':'python-requests/2.18.4'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635] H'accept-encoding':'gzip, deflate'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635] H'accept':'*/*'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635] H':path':'/'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635] H':method':'GET'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635] H'x-forwarded-proto':'http'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635] H'x-request-id':'8214e79a-057b-4826-8632-67f3f3e9afef'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635] H'x-envoy-expected-rq-timeout-ms':'3000'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635] H'x-envoy-original-path':'/qotm/'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/conn_manager_impl.cc:649] [C1][S4054012562118838635] H':scheme':'http'
[2018-03-12 15:28:47.057][188][198][trace][http] source/common/http/http1/codec_impl.cc:315] [C1] parsed 154 bytes
[2018-03-12 15:28:47.057][188][198][trace][connection] source/common/network/connection_impl.cc:229] [C1] readDisable: enabled=true disable=true
[2018-03-12 15:28:47.057][188][198][debug][main] source/server/worker_impl.cc:95] worker exited dispatch loop
[2018-03-12 15:28:47.057][188][198][debug][connection] source/common/network/connection_impl.cc:99] [C1] closing data_to_write=0 type=1
[2018-03-12 15:28:47.057][188][198][debug][connection] source/common/network/connection_impl.cc:134] [C1] closing socket: 1
[2018-03-12 15:28:47.057][188][198][debug][router] source/common/router/router.cc:895] [C1][S4054012562118838635] cancelling pool request
[2018-03-12 15:28:47.057][188][198][debug][pool] source/common/http/http1/conn_pool.cc:193] cancelling pending request
[2018-03-12 15:28:47.057][188][198][trace][main] source/common/event/dispatcher_impl.cc:126] item added to deferred deletion list (size=1)
[2018-03-12 15:28:47.057][188][198][debug][main] source/server/connection_handler_impl.cc:54] [C1] adding to cleanup list
[2018-03-12 15:28:47.057][188][198][trace][main] source/common/event/dispatcher_impl.cc:126] item added to deferred deletion list (size=2)
[2018-03-12 15:28:47.057][188][198][trace][main] source/common/event/dispatcher_impl.cc:52] clearing deferred deletion list (size=2)
[2018-03-12 15:28:47.057][188][198][debug][grpc] source/common/grpc/google_async_client_impl.cc:24] Joining completionThread
[2018-03-12 15:28:47.057][188][199][debug][grpc] source/common/grpc/google_async_client_impl.cc:56] completionThread exiting
[2018-03-12 15:28:47.057][188][198][debug][grpc] source/common/grpc/google_async_client_impl.cc:26] Joined completionThread
[2018-03-12 15:28:47.057][188][198][debug][upstream] source/common/upstream/cluster_manager_impl.cc:587] shutting down thread local cluster manager
[2018-03-12 15:28:47.057][188][198][debug][connection] source/common/network/connection_impl.cc:99] [C210] closing data_to_write=0 type=1
[2018-03-12 15:28:47.058][188][198][debug][connection] source/common/network/connection_impl.cc:134] [C210] closing socket: 1
[2018-03-12 15:28:47.058][188][198][debug][client] source/common/http/codec_client.cc:70] [C210] disconnect. resetting 0 pending requests
[2018-03-12 15:28:47.058][188][198][debug][pool] source/common/http/http1/conn_pool.cc:115] [C210] client disconnected
[2018-03-12 15:28:47.058][188][198][trace][main] source/common/event/dispatcher_impl.cc:126] item added to deferred deletion list (size=1)
[2018-03-12 15:28:47.058][188][198][trace][main] source/common/event/dispatcher_impl.cc:52] clearing deferred deletion list (size=1)
[2018-03-12 15:28:47.058][188][188][debug][grpc] source/common/grpc/google_async_client_impl.cc:24] Joining completionThread
[2018-03-12 15:28:47.058][188][195][debug][grpc] source/common/grpc/google_async_client_impl.cc:56] completionThread exiting
[2018-03-12 15:28:47.058][188][188][debug][grpc] source/common/grpc/google_async_client_impl.cc:26] Joined completionThread
[2018-03-12 15:28:47.058][188][188][debug][upstream] source/common/upstream/cluster_manager_impl.cc:587] shutting down thread local cluster manager
[2018-03-12 15:28:47.058][188][188][info][main] source/server/server.cc:392] exiting
This looks to me like Envoy is, for whatever reason, dropping a request on C1
and closing its socket before even forwarding the request upstream, much less getting a response.
Note that this is not always what I've seen in the Envoy logs. In other cases, the logs seem to imply that Envoy is in fact trying to flush all the data before closing the socket, but here... that's not the case. I can rerun it a few more times, of course.
@kflynn you are using really short drain/parent shutdown times. Perhaps the requests are just not completing in time?
Individual requests usually complete in under 10ms... and by design, my connection will never complete at all until Envoy shuts it down. So while the drain time is deliberately short, it seems that it can't be the real culprit.
After digging into this a bit further, it seems that the root of the problem is that Envoy should be sending Connection: close
when it's about to shut the connection down, per RFC7230 §6.6 (which goes on to mention that it can be risky for the server to simply summarily close the connection). Right now, Envoy simply summarily closes the connection.
Unraveling a few layers of Python (when one has to git clone cpython
, one should really question one's life choices :) ), when the server summarily closes the connection like Envoy does, it's basically pure luck whether Python requests
will notice that the connection has dropped before trying to send the next request. If it does, it will open a new connection and all is well. If not, it crashes because it can't distinguish between the request getting no response and the request never sending at all because the connection was already closed.
Sadly, that's really a POSIX thing rather than requests
getting it wrong, so I think that many clients will have the same problem. We've already seen it in both Go and Python, using popular client libraries.
When the server does send Connection: close
, though, things seem much happier. I ran a test with a custom HTTP server running on localhost, so Envoy and the entire external network aren't in the picture at all. This test-close.py
server is attached.
If you run python test-close.py no
in one window and python loop.py http://localhost:3000/
in another, then loop.py
will eventually crash with that Remote end closed connection without response
error.
If you run python test-close.py yes
, though, loop.py
just keeps on going.
(Amusingly, the crash happens much more quickly with my laptop on battery power -- usually in fewer than 200 requests. Go figure. On the other hand my yes
run, also on battery power, is up to nearly 140K requests so far and still going strong...)
Envoy will send connection: close
when it is doing a full drain close of a connection. It can only do this when it is sending a response.
I'm not sure the sequence of events that is causing this, but the root cause is your drain time is too short. Please make it longer and see if the issue still happens.
With all respect, @mattklein123, too short for what exactly? :smile:
Remember that we have only one connection active and only one request in flight -- so there's only one stream active in Envoy, and in my test it has a typical lifespan of less than 20ms. And, in fact, I see Envoy logging drain closing connection
, but not doing anything about that.
I just reran the test to be able to build a timeline. We start with Envoy PID 263 active, handling requests on [C1]
. I HUP the hot-restarter
at about 15:11:34.986. Right after that a new request comes in, which makes sense, and it takes 15 ms from the point we identify it as a new stream to the point that the response is written downstream:
new stream on [C1]
upstream headers complete
write(2) on [C1]
write(2) returns OK
Right after that, our new PID 276 gets started up and tells PID 263 to stop handling admin requests:
initializing epoch 3
(I had to restart a couple of times, Python got lucky)shutting down admin due to child startup
So PID 263 knows already that there's a restart coming, but it handles another request anyway, which makes sense because we haven't seen PID 276 really get started yet.
Then we have a couple of things overlapping: PID 276 gets its dispatcher going while PID 263 starts handling a new stream:
loading 1 listener(s)
[C1] new stream
starting main dispatch loop
closing and draining listeners
drain tick #0
PID 263 is already 9ms into handling a stream, and in fact has already made a connection to the upstream, so I wouldn't have expected to see any special handling for that request already in flight. But nope, I'm wrong about that:
-15:11:35.107 PID 263 [C1][S15918730273159564041] drain closing connection
However, there is still no Connection: close
. Here's the full log sequence for the downstream headers going out:
[2018-03-19 15:11:35.107][263][271][trace][http] source/common/http/conn_manager_impl.cc:862] [C1][S15918730273159564041] encode headers called: filter=0x3352d80 status=0
[2018-03-19 15:11:35.107][263][271][debug][http] source/common/http/conn_manager_impl.cc:890] [C1][S15918730273159564041] drain closing connection
[2018-03-19 15:11:35.107][263][271][debug][http] source/common/http/conn_manager_impl.cc:942] [C1][S15918730273159564041] encoding headers via codec (end_stream=false):
[2018-03-19 15:11:35.108][263][271][debug][http] source/common/http/conn_manager_impl.cc:947] [C1][S15918730273159564041] 'content-type':'application/json'
[2018-03-19 15:11:35.108][263][271][debug][http] source/common/http/conn_manager_impl.cc:947] [C1][S15918730273159564041] 'content-length':'172'
[2018-03-19 15:11:35.108][263][271][debug][http] source/common/http/conn_manager_impl.cc:947] [C1][S15918730273159564041] 'server':'envoy'
[2018-03-19 15:11:35.108][263][271][debug][http] source/common/http/conn_manager_impl.cc:947] [C1][S15918730273159564041] 'date':'Mon, 19 Mar 2018 15:11:34 GMT'
[2018-03-19 15:11:35.108][263][271][debug][http] source/common/http/conn_manager_impl.cc:947] [C1][S15918730273159564041] ':status':'200'
[2018-03-19 15:11:35.108][263][271][debug][http] source/common/http/conn_manager_impl.cc:947] [C1][S15918730273159564041] 'x-envoy-upstream-service-time':'9'
[2018-03-19 15:11:35.108][263][271][trace][connection] source/common/network/connection_impl.cc:323] [C1] writing 158 bytes, end_stream false
I verified that the client doesn't see any headers that aren't mentioned above, so it's not a logging issue. It's another millisecond before the downstream writes complete, so the whole request lasts 18ms.
[C1] write returns: 330
...and that's it. Envoy seems to think it's already done all it needs to, because it merrily handles 98 more requests on PID 263 [C1]
, none longer than 23 ms, before finally deciding that it's been long enough, shutting the connection down, and exiting.
So I can certainly increase the drain time, but to what end? Envoy figured out that it needed to drain the connection - and claimed that it would do exactly that, even - within 150ms from when I sent the HUP. What would extra drain time accomplish?
@kflynn I'm sorry but I don't have time to go through this in detail. Envoy uses an incremental drain sequence to slowly drain connections. It's quite likely that your connection is not getting selected for drain, and then the parent shutdown time is too short to allow for proper drain before the entire process shuts down.
As I've already asked, please try longer drain times and see if that helps the issue. Like potentially 5s drain, 10s parent shutdown.
@kflynn I'm going to go ahead and close this out for now. Please reopen if the longer drain times don't solve the problem or if after debugging further it looks like some type of Envoy bug.
A client using HTTP/1.1 to send multiple successive requests to an Envoy can see its connection drop in the middle of a request when the Envoy is hot-restarted.
My reproduction case is on Kubernetes, running with a custom Ambassador Envoy image that includes PIDs in the logs:
That image includes a stripped debug-built Envoy running on Alpine, with
logger.cc
changed to includelogger_->set_pattern("[%Y-%m-%d %T.%e][%P][%t][%l][%n] %v");
Ambassador's
extauth
filter is not in use. In fact Ambassador isn't running; I usekubectl exec
to get a shell on my pod and then runpython3 hot-restart.py start-envoy.sh
where
hot-restart.py
andstart-envoy.sh
are both included in the attached ZIP archive.Envoy ends up running as
and I then run - on my laptop - a single instance of
loop.py
, also included in the ZIP file, but reproduced here too:With the single instance of
loop.py
running, send a SIGHUP tohot-restarter.py
. It won't take many restarts beforeloop.py
crashes, complaining thatRemote end closed connection without response
.Note that digging into the Envoy's trace logs, nothing looks obviously wrong: I see the new Envoy process picking up connections (and the old process not picking them up) after the restart, and I see
write(2)
returning values that look correct and don't look like things are getting dropped. But this doesn't seem like expected behavior, either.envoy-report.zip