Open alyst opened 2 years ago
Could have been introduced by https://github.com/JuliaLang/MbedTLS.jl/pull/245 perhaps, given the locking in the backtrace. What do you think, @quinnj?
Nah, that's a different lock. More likely due to #243 which changed around some of the ssl.jl code; cc: @vtjnash. Note however that the version of HTTP.jl being used here has known multithreading issues, but that doesn't seem to be a factor in this case.
Is it spinlooping at 100% CPU? The code in HTTP.jl looked like it might just livelock if it encounters any errors, since it wraps most operations in while true; try; success() && return; catch; #= fail to handle errors =# end; end
Is it spinlooping at 100% CPU?
I haven't done dedicated testing, and from my normal interaction with Julia is a bit hard to say, because Julia VSCode extension also has some problems at the moment. But I have an impression that it doesn't spinloop, the CPU usage is zero.
since it wraps most operations in ...
I'm not sure what this comment is referring to in the HTTP.jl code? Can you point out specifically where this is the pattern and causing problems?
okay, sorry, it does look like most of the try
blocks should be outside of while loop https://github.com/JuliaWeb/HTTP.jl/pull/888
@alyst , can you check that this issue is fixed after upgrading to v1.1.2?
can you check that this issue is fixed after upgrading to v1.1.2?
Hmm, it looks like it is still there:
ERROR: InterruptException:
Stacktrace:
[1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
[2] wait()
@ Base ./task.jl:872
[3] wait(c::Base.GenericCondition{Base.Threads.SpinLock})
@ Base ./condition.jl:123
[4] lock(rl::ReentrantLock)
@ Base ./lock.jl:100
[5] ssl_check_pending(ctx::MbedTLS.SSLContext)
@ MbedTLS ~/.julia/packages/MbedTLS/W4GKm/src/ssl.jl:606
[6] wait_for_decrypted_data(ctx::MbedTLS.SSLContext)
@ MbedTLS ~/.julia/packages/MbedTLS/W4GKm/src/ssl.jl:305
[7] eof
@ ~/.julia/packages/MbedTLS/W4GKm/src/ssl.jl:177 [inlined]
[8] read_to_buffer(t::HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}, sizehint::Int64)
@ HTTP.ConnectionPool ~/.julia/packages/HTTP/aTjcj/src/ConnectionPool.jl:274
[9] readuntil(t::HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}, f::Function, sizehint::Int64)
@ HTTP.ConnectionPool ~/.julia/packages/HTTP/aTjcj/src/ConnectionPool.jl:294
[10] readuntil
@ ~/.julia/packages/HTTP/aTjcj/src/ConnectionPool.jl:292 [inlined]
[11] readheaders(io::HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}, message::HTTP.Messages.Response)
@ HTTP.Messages ~/.julia/packages/HTTP/aTjcj/src/Messages.jl:537
[12] startread(http::HTTP.Streams.Stream{HTTP.Messages.Response, HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}})
@ HTTP.Streams ~/.julia/packages/HTTP/aTjcj/src/Streams.jl:170
[13] macro expansion
@ ~/.julia/packages/HTTP/aTjcj/src/StreamRequest.jl:67 [inlined]
[14] macro expansion
@ ./task.jl:399 [inlined]
[15] request(::Type{HTTP.StreamRequest.StreamLayer{Union{}}}, io::HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}, req::HTTP.Messages.Request, body::Vector{UInt8}; reached_redirect_limit::Bool, response_stream::Base.BufferStream, iofunction::Nothing, verbose::Int64, kw::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:require_ssl_verification,), Tuple{Bool}}})
@ HTTP.StreamRequest ~/.julia/packages/HTTP/aTjcj/src/StreamRequest.jl:57
[16] request(::Type{HTTP.ConnectionRequest.ConnectionPoolLayer{HTTP.StreamRequest.StreamLayer{Union{}}}}, url::URIs.URI, req::HTTP.Messages.Request, body::Vector{UInt8}; proxy::Nothing, socket_type::Type, reuse_limit::Int64, kw::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:iofunction, :require_ssl_verification, :response_stream), Tuple{Nothing, Bool, Base.BufferStream}}})
@ HTTP.ConnectionRequest ~/.julia/packages/HTTP/aTjcj/src/ConnectionRequest.jl:108
[17] request(::Type{HTTP.ExceptionRequest.ExceptionLayer{HTTP.ConnectionRequest.ConnectionPoolLayer{HTTP.StreamRequest.StreamLayer{Union{}}}}}, ::URIs.URI, ::Vararg{Any}; kw::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:iofunction, :require_ssl_verification, :response_stream), Tuple{Nothing, Bool, Base.BufferStream}}})
@ HTTP.ExceptionRequest ~/.julia/packages/HTTP/aTjcj/src/ExceptionRequest.jl:19
[18] request(::Type{HTTP.MessageRequest.MessageLayer{HTTP.ExceptionRequest.ExceptionLayer{HTTP.ConnectionRequest.ConnectionPoolLayer{HTTP.StreamRequest.StreamLayer{Union{}}}}}}, method::String, url::URIs.URI, headers::Vector{Pair{SubString{String}, SubString{String}}}, body::Vector{UInt8}; http_version::VersionNumber, target::String, parent::Nothing, iofunction::Nothing, kw::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:require_ssl_verification, :response_stream), Tuple{Bool, Base.BufferStream}}})
@ HTTP.MessageRequest ~/.julia/packages/HTTP/aTjcj/src/MessageRequest.jl:66
[19] request(::Type{HTTP.BasicAuthRequest.BasicAuthLayer{HTTP.MessageRequest.MessageLayer{HTTP.ExceptionRequest.ExceptionLayer{HTTP.ConnectionRequest.ConnectionPoolLayer{HTTP.StreamRequest.StreamLayer{Union{}}}}}}}, method::String, url::URIs.URI, headers::Vector{Pair{SubString{String}, SubString{String}}}, body::Vector{UInt8}; kw::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:require_ssl_verification, :response_stream), Tuple{Bool, Base.BufferStream}}})
@ HTTP.BasicAuthRequest ~/.julia/packages/HTTP/aTjcj/src/BasicAuthRequest.jl:28
[20] #request#1
@ ~/.julia/packages/HTTP/aTjcj/src/TopRequest.jl:15 [inlined]
[21] macro expansion
@ ~/.julia/packages/Mocking/MsKoy/src/mock.jl:29 [inlined]
[22] (::AWS.var"#40#42"{Request, DataType, OrderedCollections.LittleDict{Symbol, Any, Vector{Symbol}, Vector{Any}}})()
@ AWS ~/.julia/packages/AWS/E9zQ4/src/utilities/request.jl:221
[23] (::Base.var"#76#78"{Base.var"#76#77#79"{AWS.AWSExponentialBackoff, AWS.var"#41#43", AWS.var"#40#42"{Request, DataType, OrderedCollections.LittleDict{Symbol, Any, Vector{Symbol}, Vector{Any}}}}})(; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Base ./error.jl:294
[24] (::Base.var"#76#78"{Base.var"#76#77#79"{AWS.AWSExponentialBackoff, AWS.var"#41#43", AWS.var"#40#42"{Request, DataType, OrderedCollections.LittleDict{Symbol, Any, Vector{Symbol}, Vector{Any}}}}})()
@ Base ./error.jl:290
[25] _http_request(http_backend::AWS.HTTPBackend, request::Request, response_stream::IOBuffer)
@ AWS ~/.julia/packages/AWS/E9zQ4/src/utilities/request.jl:251
[26] macro expansion
@ ~/.julia/packages/Mocking/MsKoy/src/mock.jl:29 [inlined]
[27] (::AWS.var"#33#36"{AWSConfig, Request, IOBuffer, Vector{Int64}})()
@ AWS ~/.julia/packages/AWS/E9zQ4/src/utilities/request.jl:125
[28] (::AWS.var"#34#38"{AWS.var"#33#36"{AWSConfig, Request, IOBuffer, Vector{Int64}}, IOBuffer})()
@ AWS ~/.julia/packages/AWS/E9zQ4/src/utilities/request.jl:140
[29] (::Base.var"#76#78"{Base.var"#76#77#79"{AWS.AWSExponentialBackoff, AWS.var"#35#39"{AWSConfig, Vector{String}, Vector{String}, Int64}, AWS.var"#34#38"{AWS.var"#33#36"{AWSConfig, Request, IOBuffer, Vector{Int64}}, IOBuffer}}})(; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Base ./error.jl:294
[30] (::Base.var"#76#78"{Base.var"#76#77#79"{AWS.AWSExponentialBackoff, AWS.var"#35#39"{AWSConfig, Vector{String}, Vector{String}, Int64}, AWS.var"#34#38"{AWS.var"#33#36"{AWSConfig, Request, IOBuffer, Vector{Int64}}, IOBuffer}}})()
@ Base ./error.jl:290
[31] submit_request(aws::AWSConfig, request::Request; return_headers::Nothing)
@ AWS ~/.julia/packages/AWS/E9zQ4/src/utilities/request.jl:191
[32] (::RestXMLService)(request_method::String, request_uri::String, args::Dict{String, Any}; aws_config::AWSConfig, feature_set::AWS.FeatureSet)
@ AWS ~/.julia/packages/AWS/E9zQ4/src/AWS.jl:284
[33] put_object(Bucket::String, Key::String, params::Dict{String, Any}; aws_config::AWSConfig)
@ Main.S3 ~/.julia/packages/AWS/E9zQ4/src/services/s3.jl:5285
[34] put_object(Bucket::String, Key::String, params::Dict{String, Any})
@ Main.S3 ~/.julia/packages/AWS/E9zQ4/src/services/s3.jl:5285
ok, pinging @vtjnash here; I'm currently on vacation, but can help dig in and try to figure out what's going on once I get back (next week).
@alyst, we've had both MbedTLS.jl and HTTP.jl releases w/ additional bugfixes; can you confirm if your issue is still present?
@quinnj Thanks a lot for your work! Unfortunately, at the moment I don't run the workloads that expose these problems, but I should return to it in a few days. I'll let you know ASAP.
@quinnj I can confirm that I ran into this bug on MbedTLS 1.1.3, so it's not resolved quite yet. I saw the same RequestTimeTooSkewed
error on S3.put_object()
. I was able to resolve by restricting to <= 1.1.0.
@ianfiske, were you also upgraded to HTTP.jl 1.2.1?
No, I was on HTTP 0.9.17 -- I didn't realize that we were on such an old HTTP.jl!
Unfortunately, the issue was difficult to reproduce outside of our AWS environment, even locally with the exact same docker container. And because of details I can't get in to, I can't make changes to that environment right now to test again with latest MbedTLS and HTTP. When I have an opportunity to do such a test, I'll post an update here, unless @alyst gets a chance first.
any update @alyst? @ianfiske?
FWIW, I'm working on an updated package that does a tighter integration with HTTP.jl to do header signing for cloud providers. At least using my AWS code I have there, I can't reproduce this issue, either by doing a large download, then upload, or many small downloads and uploads.
It'd be great if we could confirm if this is still an issue and the clearest steps possible to reproduce so we can track down whatever is going on.
Just a quick update here, I have a bit of a version dependency knot that is making it rather difficult to upgrade HTTP past 0.9.17 to test if that works within our context. It's too complicated and tangential to get into details, but it's related to version bounds that we have for JuMP and Cbc. When we can untie this knot, and try this out, I'll post an update. Just letting you know that I didn't forget about this.
With the recent update,
S3.put_object()
of data 80000 bytes and more fails after the previousS3.get_object()
of the large (>=1GB) file (at least I believe that's the pattern).The request
is frozen for ~15mins, then fails with
I've checked, that is not because of the local clock skew: the
<RequestTime>
is the time it was actually submitted, and<ServerTime>
is the time the execution ofS3.put_object()
was unfrozen with the exception.When I interrupt the request, it reveals that it's sitting in MbedTLS:
Downgrading to MbedTLS 1.1.0 seems to solve the issue.