JuliaLang / MbedTLS.jl

Wrapper around mbedtls
Other
41 stars 50 forks source link

AWS API calls through AWSCore.jl/HTTP.jl/MbedTLS.jl hanging #174

Closed samoconnor closed 6 years ago

samoconnor commented 6 years ago

We have had an issue of intermittent hang-ups when doing large volumes of AWS API calls through AWSCore and HTTP.jl.

One manifestation of the problem occurred after several HTTP Transactions with non-trivial size Response Body had already occurred on a connection. The server would send a final 5-byte 0\r\n\r\n end-of-chunked-body marker. A tiny packet compared to the previous steady stream of body packets. Somehow or other this last packet was never seen by the TLS layer. They HTTP layer was waiting in eof(::TCPSocket) but never waking up. It is possible that a Delayed ACK / Naggle interaction caused this last backed to be delayed a bit (Linux client).

Somehow this timing difference seems to have resulted in a race-condition whereby eof() goes to sleep even though there is data waiting. Setting up an async task to trigger when the hang-up had gone on for 2 seconds allowed us to see that the data was, at that point available in the TLS buffer, event though right before going into eof() it was not there.

A while, back we added this Base.start_reading band-aid to try to keep libuv watching the socket so that we don't miss TLS close_notify messages when no one is reading from the connection: https://github.com/JuliaWeb/MbedTLS.jl/blob/7ee7901442c5d9d3b80088244977033a5f7a4ff9/src/ssl.jl#L186-L202

It seems likely that some interaction between the isopen(ctx) call above and the going-into-eof code is causing the problem.

There are a few bad smells around. First, the original design of the library didn't really account for OOB messages like TLS close_notify (and other internal TLS messages that client and server may send to each other). This is why we ended up with the @async while isopen(ctx) loop above, and is also the reason for some convoluted code to do zero-length reads (which pump the TLS library's internal state) before doing bytes available queries. Another aspect of this that seems not quite is the fact that eof(::SSLContext) calls eof(::TCPSocket) directly. It all feels at this point like a bunch of work-arounds (many of them implemented by me in haste) for an initial assumption that there would be more of a 1-1 correspondence between encrypted data arriving and decrypted data being ready to read.

PRs to follow...

samoconnor commented 6 years ago

Thankyou @quinnj for reviewing #175 and #167 so quickly! 🙏

quinnj commented 6 years ago

Yeah, I started to look over #177 and #178 , but I'll do a more in-depth review tonight (6 hours or so)

quinnj commented 6 years ago

I've been running MbedTLS + HTTP master in a dev environment for a couple hours now and it's been looking good. I did see this error/backtrace on (I believe) the first https request that was made:

2018-10-16T18:21:34.672023325Z ┌ Warning: Error in @async writebody task
2018-10-16T18:21:34.672084004Z │   exception =
2018-10-16T18:21:34.672112144Z │    ArgumentError: `unsafe_write` requires `iswritable(::SSLContext)`
2018-10-16T18:21:34.672123265Z │    Stacktrace:
2018-10-16T18:21:34.672132706Z │     [1] ssl_unsafe_write(::MbedTLS.SSLContext, ::Ptr{UInt8}, ::UInt64) at /home/domorun/.julia/packages/MbedTLS/cXN47/src/ssl.jl:194
2018-10-16T18:21:34.672142684Z │     [2] unsafe_write at /home/domorun/.julia/packages/MbedTLS/cXN47/src/ssl.jl:371 [inlined]
2018-10-16T18:21:34.672161296Z │     [3] unsafe_write(::HTTP.Streams.Stream{HTTP.Messages.Response,HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}}, ::Ptr{UInt8}, ::UInt64) at /home/domorun/.julia/packages/HTTP/nV2Zw/src/Streams.jl:96
2018-10-16T18:21:34.672171419Z │     [4] unsafe_write at ./io.jl:509 [inlined]
2018-10-16T18:21:34.672180571Z │     [5] macro expansion at ./gcutils.jl:87 [inlined]
2018-10-16T18:21:34.672196736Z │     [6] write at ./io.jl:532 [inlined]
2018-10-16T18:21:34.672206384Z │     [7] writebody(::HTTP.Streams.Stream{HTTP.Messages.Response,HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}}, ::HTTP.Messages.Request, ::Array{UInt8,1}) at /home/domorun/.julia/packages/HTTP/nV2Zw/src/StreamRequest.jl:100
2018-10-16T18:21:34.672220741Z │     [8] (::getfield(HTTP.StreamRequest, Symbol("##2#3")){HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext},HTTP.Messages.Request,Array{UInt8,1}})() at ./task.jl:259
2018-10-16T18:21:34.672230712Z └ @ HTTP.StreamRequest ~/.julia/packages/HTTP/nV2Zw/src/StreamRequest.jl:60

Not sure if we're maybe running into some kind of first-run-compilation race condition or something? Or somehow compiling the "handshake" stack of functions somehow affects the actual handshake differently than it used to?

samoconnor commented 6 years ago

If we got up to writebody then we must have been through startwrite (headers) without issue. Can you try adding this to HTTP.jl/src/IOExtras.jl ...

isioerror(e::ArgumentError) = e.msg == "stream is closed or unusable"
isioerror(e::ArgumentError) = e.msg == "`unsafe_write` requires `iswritable(::SSLContext)`"
samoconnor commented 6 years ago

... or perhaps this would be better:

--- a/src/ssl.jl
+++ b/src/ssl.jl
@@ -191,6 +191,8 @@ See `f_send` and `set_bio!` below.
 """
 function ssl_unsafe_write(ctx::SSLContext, buf::Ptr{UInt8}, nbytes::UInt)

+    Base.check_open(ctx.bio)
+
     iswritable(ctx) ||
     throw(ArgumentError("`unsafe_write` requires `iswritable(::SSLContext)`"))
samoconnor commented 6 years ago

That way we should get the old stream is closed or unusable message (and also be in sync with https://github.com/JuliaLang/julia/pull/29274).

samoconnor commented 6 years ago

The the "requires iswritable(::SSLContext)" is still useful in the case where check_open finds that the connection is open. In this case iswritable = false means that someone explicitly called close on the SSLContext.

samoconnor commented 6 years ago

I've been running MbedTLS + HTTP master in a dev environment for a couple hours now and it's been looking good.

That is good news ! 🤖😬✅ 🎉

samoconnor commented 6 years ago

Base.check_open(ctx.bio) added ... master: https://github.com/JuliaWeb/MbedTLS.jl/commit/12e0100084d0bf0e59d78562dd8409ee2efbaa68 0.6: https://github.com/JuliaWeb/MbedTLS.jl/commit/76036da91ac03b7429c55f4bb1d8e5ba468b35cc

samoconnor commented 6 years ago

Fixed 1.0 compat on master: https://github.com/JuliaWeb/MbedTLS.jl/commit/0904bde166b5675a93a148d32b52ce2331f65b4e

samoconnor commented 6 years ago

@janfrancu reports that this resolves an issue with AWSS3.jl: https://github.com/JuliaWeb/HTTP.jl/issues/307#issuecomment-430987933

samoconnor commented 6 years ago

Ready to tag @quinnj ?

quinnj commented 6 years ago

Yeah; with the testing I've been doing, I seem to see more IOErrors than before, but I believe these were cases that were probably getting lost/swallowed before.