JuliaLang / Downloads.jl

MIT License
88 stars 35 forks source link

Too much lock contention? #247

Open IanButterworth opened 3 months ago

IanButterworth commented 3 months ago

Trying to investigate this at the JuliaCon hackathon. Opening for discussion.

$ julia -t4

julia> function foo(url, n; downloader=()->Downloads.Downloader(;grace=0))
       Threads.@threads for _ in 1:n
           Downloads.download(url; downloader=downloader())
       end
       end
foo (generic function with 3 methods)

julia> @time foo("https://julialang-s3.julialang.org/bin/mac/aarch64/1.10/julia-1.10.4-macaarch64.dmg", 4, downloader=Returns(nothing))
 28.483014 seconds (916.99 k allocations: 659.244 MiB, 0.70% gc time, 22905 lock conflicts, 1.15% compilation time)

julia> @time foo("https://julialang-s3.julialang.org/bin/mac/aarch64/1.10/julia-1.10.4-macaarch64.dmg", 4)
 26.974909 seconds (1.23 M allocations: 675.870 MiB, 0.82% gc time, 1801 lock conflicts, 1.32% compilation time)

Note 1801 lock conflicts even on the task local Downloader version

StefanKarpinski commented 3 months ago

Can you figure out which locks have contention?

StefanKarpinski commented 3 months ago

I suspect it's not the download start lock but probably the IO locks.

IanButterworth commented 3 months ago

This is the most common lock from @profile https://github.com/JuliaLang/Downloads.jl/blob/971840a9da72ba3743b837e1217c697db7bab298/src/Curl/Multi.jl#L197-L201

I've also moved to hosting a file locally on a http server and downloading from tha, to test theoretical limit because the conference and hotel wifi isn't great, and with that I see a lot fewer locks.

one download per thread

julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 6, downloader=Returns(nothing))
  0.958359 seconds (493.32 k allocations: 945.356 MiB, 7.87% gc time, 2380 lock conflicts)

julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 6)
  0.949370 seconds (460.36 k allocations: 790.393 MiB, 11.77% gc time, 31 lock conflicts)

many downloads per thread

julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 60, downloader=Returns(nothing))
 21.439162 seconds (4.94 M allocations: 9.233 GiB, 4.32% gc time, 24624 lock conflicts)

julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 60)
 12.926270 seconds (5.43 M allocations: 9.108 GiB, 3.51% gc time, 567 lock conflicts)

I'm not sure whether a locally http server is a fair test of this.. but the speed is a lot faster.

And if I set https://curl.se/libcurl/c/CURLOPT_BUFFERSIZE.html

setopt(easy, CURLOPT_BUFFERSIZE, 10_000_000)

we get a good speed up (both here with task local Downloaders)

julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 6)
  0.568647 seconds (709.51 k allocations: 956.762 MiB, 30.29% gc time, 19 lock conflicts, 102.74% compilation time: <1% of which was recompilation)

julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 60)
  3.849536 seconds (4.77 M allocations: 9.075 GiB, 21.38% gc time, 14 lock conflicts, 0.53% compilation time)

Apparently the default buffersize is 16kB and the max is 10MB

IanButterworth commented 3 months ago

Take all the numbers here with a pinch of salt, I just tried default settings again and saw similar speeds to the latter results..

Note that the buffersize is increased in this example https://github.com/curl/curl/blob/3765d75ce47b66289f946382b649d0e99389dc77/docs/examples/http2-download.c#L168-L169

and curl itself uses 100kB https://github.com/curl/curl/blob/3765d75ce47b66289f946382b649d0e99389dc77/src/tool_operate.c#L271

Some comparisons:

Downloads.jl - normal default settings from master

julia> for _ in 1:5 @time Downloads.download("http://localhost:8000/julia-1.10.4-macaarch64.dmg"); end
  0.104941 seconds (92.07 k allocations: 158.067 MiB, 7.03% gc time)
  0.090938 seconds (92.08 k allocations: 158.068 MiB, 6.12% gc time)
  0.096240 seconds (92.07 k allocations: 158.067 MiB, 5.23% gc time)
  0.090895 seconds (92.08 k allocations: 158.067 MiB, 5.55% gc time)
  0.088865 seconds (92.08 k allocations: 158.068 MiB, 5.88% gc time)

julia> 163 / 0.09
1811.111111111111

curl cli - 1 GB/s

% url=http://localhost:8000/julia-1.10.4-macaarch64.dmg
% curl -O -s -w "%{speed_download}" "$url" | awk '{printf "Download speed: %.2f GB/s\n", $1/1024/1024/1024}'
Download speed: 1.02 GB/s
...
Download speed: 1.17 GB/s
Download speed: 1.10 GB/s
Download speed: 1.04 GB/s

wget cli - 2.6 GB/s

And wget gives 2.6 GB/s

% wget --output-document=/dev/null http://localhost:8000/julia-1.10.4-macaarch64.dmg
--2024-07-13 23:33:14--  http://localhost:8000/julia-1.10.4-macaarch64.dmg
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 163644235 (156M) [application/x-apple-diskimage]
Saving to: '/dev/null'

/dev/null                                     100%[==============================================================================================>] 156.06M  --.-KB/s    in 0.06s   

2024-07-13 23:33:15 (2.50 GB/s) - '/dev/null' saved [163644235/163644235]
...
2024-07-13 23:41:44 (2.72 GB/s) - ‘/dev/null’ saved [163644235/163644235]
2024-07-13 23:41:45 (2.67 GB/s) - ‘/dev/null’ saved [163644235/163644235]
2024-07-13 23:41:46 (2.82 GB/s) - ‘/dev/null’ saved [163644235/163644235]

Thanks to @giordano for pointing some of this out.

IanButterworth commented 3 months ago

Using https://github.com/JuliaLang/julia/pull/55120 it turns out the lock that is mainly hitting the slow lock is https://github.com/JuliaLang/Downloads.jl/blob/d363a02bac50ddff422e14a5aa89464dc7fdad0b/src/Curl/Multi.jl#L144

and that's happening with timeout_ms == 0


One thing I'm noticing is that for

Downloads.download(url; downloader=Downloads.Downloader(;grace=0))

we never see lock conflicts.

Nor for

fetch(@spawn Downloads.download(url; downloader=Downloads.Downloader(;grace=0)))

But we do for

Threads.@threads for _ in 1:10
    Downloads.download(url; downloader=Downloads.Downloader(;grace=0))
end

A small number can be attributed to this lock when generating the Downloaders (which could be more efficiently managed to be task-local in the @thread for) https://github.com/JuliaLang/Downloads.jl/blob/d363a02bac50ddff422e14a5aa89464dc7fdad0b/src/Curl/Multi.jl#L11

IanButterworth commented 3 months ago

I think a few things we might want to consider doing are:

1) By default setting setopt(easy, CURLOPT_BUFFERSIZE, 10_000_000) - increase from the default of 16 kB to 10 MB 2) Increase CONNECT_SEMAPHORE https://github.com/JuliaLang/Downloads.jl/blob/d363a02bac50ddff422e14a5aa89464dc7fdad0b/src/Curl/Easy.jl#L56 3) Document that when using concurrent tasks it's best to use a task-local Downloader

I'll benchmark the parameter space when on a strong internet connection.

In the meantime I've spun up 4 local http servers on different ports from a ramdisk, and accessing them individually from each task, but I'm not sure it's a fair test.

This is 4 sequential downloads on each of 4 concurrent tasks, 10 times

julia> foo(4, 4, 10)
1532.902446759521 MB/s
2447.040027683379 MB/s
1884.3813202392432 MB/s
2017.8544918394593 MB/s
1973.309438735145 MB/s
1632.805491498009 MB/s
2153.163424235369 MB/s
1935.19328559164 MB/s
1728.2429605092427 MB/s
1970.3497493574216 MB/s

Added results from locally hosted servers here https://github.com/JuliaLang/Downloads.jl/compare/master...ib/increase_buffersize

IanButterworth commented 3 months ago

I think I'm reaching a conclusion here.

1) The current defaults for buffersize & CONNECT_SEMAPHORE are actually optimal already.

I'll close.

StefanKarpinski commented 3 months ago

I'd really like for the defaults to be fast and not require jumping through any hoops. We've put so much work into making this concurrent under the hood with a simple blocking API, seems lame to not make it actually fast. So I'd like to actually fix this and make the warning unnecessary.

I have a few paths I'd like to explore. There's an option for multithreaded libcurl usage—there might be a way to make that work and avoid needing a lock. Another option would be to create a Downloader object per thread. The only way to observe that is the easy hook stuff which is an unofficial and not great API anyway, so we could reasonably do that. The other approach would be something with channels or other ways to avoid locks.