JuliaLang / Downloads.jl

MIT License
89 stars 34 forks source link

High contention use of Downloads.curl silently errors #245

Open haberdashPI opened 1 month ago

haberdashPI commented 1 month ago

In a high-contention situation (96 threads, many small files) I am getting the following output when downloading files via AWS/AWSS3.jl:

┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
┌ Error: curl_multi_socket_action: 8
└ @ Downloads.Curl ~/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/share/julia/stdlib/v1.9/Downloads/src/Curl/utils.jl:57
ERROR: TaskFailedException
Stacktrace:
  [1] wait
    @ ./task.jl:349 [inlined]
  [2] fetch
    @ ./task.jl:369 [inlined]
  [3] fetch
    @ ~/.julia/packages/StableTasks/3CrzR/src/internals.jl:9 [inlined]
  [4] macro expansion
    @ ./reduce.jl:260 [inlined]
  [5] macro expansion
    @ ./simdloop.jl:77 [inlined]
  [6] mapreduce_impl(f::typeof(fetch), op::OhMyThreads.Implementation.var"#99#100", A::Vector{StableTasks.StableTask{Nothing}}, ifirst::Int64, ilast::Int64, blksize::Int64)
    @ Base ./reduce.jl:258
  [7] mapreduce_impl
    @ ./reduce.jl:272 [inlined]
  [8] _mapreduce(f::typeof(fetch), op::OhMyThreads.Implementation.var"#99#100", #unused#::IndexLinear, A::Vector{StableTasks.StableTask{Nothing}})
    @ Base ./reduce.jl:442
  [9] _mapreduce_dim(f::Function, op::Function, #unused#::Base._InitialValue, A::Vector{StableTasks.StableTask{Nothing}}, #unused#::Colon)
    @ Base ./reducedim.jl:365
 [10] #mapreduce#801
    @ ./reducedim.jl:357 [inlined]
 [11] mapreduce(f::Function, op::Function, A::Vector{StableTasks.StableTask{Nothing}})
    @ Base ./reducedim.jl:357
 [12] _tmapreduce(f::Function, op::Function, Arrs::Tuple{Vector{AWSS3.S3Path{Nothing}}, Vector{DataFrames.AbstractDataFrame}}, #unused#::Type{Nothing}, scheduler::OhMyThreads.Schedulers.DynamicScheduler{OhMyThreads.Schedulers.FixedCount}, mapreduce_kwargs::NamedTuple{(:init,), Tuple{Nothing}})
    @ OhMyThreads.Implementation ~/.julia/packages/OhMyThreads/V13wc/src/implementation.jl:96
 [13] tmapreduce(::Function, ::Function, ::Vector{AWSS3.S3Path{Nothing}}, ::Vararg{Any}; scheduler::OhMyThreads.Schedulers.NotGiven, outputtype::Type, init::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ OhMyThreads.Implementation ~/.julia/packages/OhMyThreads/V13wc/src/implementation.jl:68
 [14] tforeach(::Function, ::Vector{AWSS3.S3Path{Nothing}}, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ OhMyThreads.Implementation ~/.julia/packages/OhMyThreads/V13wc/src/implementation.jl:294
 [15] tforeach
    @ ~/.julia/packages/OhMyThreads/V13wc/src/implementation.jl:293 [inlined]

...more stacktrace...

nested task error: AWS.AWSExceptions.AWSException: RequestTimeout -- Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.

    HTTP.Exceptions.StatusError(400, "PUT", "...[elided]...", HTTP.Messages.Response:
    """
    HTTP/1.1 400 Bad Request
    x-amz-request-id: N1J1KZKTENGH5DNK
    x-amz-id-2: l3Wg7BNYvxe4fLCWkK12DYVBaFK1USQHL6rGjzTdGbNnU7LnhF2TWL/XDjpjUjZvJAXVKvPnc+4=
    content-type: application/xml
    transfer-encoding: chunked
    date: Wed, 29 May 2024 15:34:22 GMT
    server: AmazonS3
    connection: close

    [Message Body was streamed]""")

    <?xml version="1.0" encoding="UTF-8"?>
    <Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>N1J1KZKTENGH5DNK</RequestId><HostId>l3Wg7BNYvxe4fLCWkK12DYVBaFK1USQHL6rGjzTdGbNnU7LnhF2TWL/XDjpjUjZvJAXVKvPnc+4=</HostId></Error>

Unfortunately the code and some parts of the stack trace include proprietary info I cannot post publicly. However, at a guess a simple use of Threads.@spawn with many files uploaded/downloaded via AWS.jl/AWSS3.jl (or even just using Download.jl directly) with many threads running concurrently would also trigger this, as the basic step where this happens is not too complicated. I will try to create an MWE when I have the bandwidth.

Of course, I quickly learned one can easily fix this via asyncmap etc... but it would be nice if this weren't necessary.

StefanKarpinski commented 1 month ago

Agree, using asyncmap should not be necessary.