Open IanButterworth opened 3 months ago
Can you figure out which locks have contention?
I suspect it's not the download start lock but probably the IO locks.
This is the most common lock from @profile
https://github.com/JuliaLang/Downloads.jl/blob/971840a9da72ba3743b837e1217c697db7bab298/src/Curl/Multi.jl#L197-L201
I've also moved to hosting a file locally on a http server and downloading from tha, to test theoretical limit because the conference and hotel wifi isn't great, and with that I see a lot fewer locks.
julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 6, downloader=Returns(nothing))
0.958359 seconds (493.32 k allocations: 945.356 MiB, 7.87% gc time, 2380 lock conflicts)
julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 6)
0.949370 seconds (460.36 k allocations: 790.393 MiB, 11.77% gc time, 31 lock conflicts)
julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 60, downloader=Returns(nothing))
21.439162 seconds (4.94 M allocations: 9.233 GiB, 4.32% gc time, 24624 lock conflicts)
julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 60)
12.926270 seconds (5.43 M allocations: 9.108 GiB, 3.51% gc time, 567 lock conflicts)
I'm not sure whether a locally http server is a fair test of this.. but the speed is a lot faster.
And if I set https://curl.se/libcurl/c/CURLOPT_BUFFERSIZE.html
setopt(easy, CURLOPT_BUFFERSIZE, 10_000_000)
we get a good speed up (both here with task local Downloader
s)
julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 6)
0.568647 seconds (709.51 k allocations: 956.762 MiB, 30.29% gc time, 19 lock conflicts, 102.74% compilation time: <1% of which was recompilation)
julia> @time foo("http://localhost:8000/julia-1.10.4-macaarch64.dmg", 60)
3.849536 seconds (4.77 M allocations: 9.075 GiB, 21.38% gc time, 14 lock conflicts, 0.53% compilation time)
Apparently the default buffersize is 16kB and the max is 10MB
Take all the numbers here with a pinch of salt, I just tried default settings again and saw similar speeds to the latter results..
Note that the buffersize is increased in this example https://github.com/curl/curl/blob/3765d75ce47b66289f946382b649d0e99389dc77/docs/examples/http2-download.c#L168-L169
and curl itself uses 100kB https://github.com/curl/curl/blob/3765d75ce47b66289f946382b649d0e99389dc77/src/tool_operate.c#L271
Some comparisons:
julia> for _ in 1:5 @time Downloads.download("http://localhost:8000/julia-1.10.4-macaarch64.dmg"); end
0.104941 seconds (92.07 k allocations: 158.067 MiB, 7.03% gc time)
0.090938 seconds (92.08 k allocations: 158.068 MiB, 6.12% gc time)
0.096240 seconds (92.07 k allocations: 158.067 MiB, 5.23% gc time)
0.090895 seconds (92.08 k allocations: 158.067 MiB, 5.55% gc time)
0.088865 seconds (92.08 k allocations: 158.068 MiB, 5.88% gc time)
julia> 163 / 0.09
1811.111111111111
% url=http://localhost:8000/julia-1.10.4-macaarch64.dmg
% curl -O -s -w "%{speed_download}" "$url" | awk '{printf "Download speed: %.2f GB/s\n", $1/1024/1024/1024}'
Download speed: 1.02 GB/s
...
Download speed: 1.17 GB/s
Download speed: 1.10 GB/s
Download speed: 1.04 GB/s
And wget gives 2.6 GB/s
% wget --output-document=/dev/null http://localhost:8000/julia-1.10.4-macaarch64.dmg
--2024-07-13 23:33:14-- http://localhost:8000/julia-1.10.4-macaarch64.dmg
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 163644235 (156M) [application/x-apple-diskimage]
Saving to: '/dev/null'
/dev/null 100%[==============================================================================================>] 156.06M --.-KB/s in 0.06s
2024-07-13 23:33:15 (2.50 GB/s) - '/dev/null' saved [163644235/163644235]
...
2024-07-13 23:41:44 (2.72 GB/s) - ‘/dev/null’ saved [163644235/163644235]
2024-07-13 23:41:45 (2.67 GB/s) - ‘/dev/null’ saved [163644235/163644235]
2024-07-13 23:41:46 (2.82 GB/s) - ‘/dev/null’ saved [163644235/163644235]
Thanks to @giordano for pointing some of this out.
Using https://github.com/JuliaLang/julia/pull/55120 it turns out the lock that is mainly hitting the slow lock is https://github.com/JuliaLang/Downloads.jl/blob/d363a02bac50ddff422e14a5aa89464dc7fdad0b/src/Curl/Multi.jl#L144
and that's happening with timeout_ms == 0
One thing I'm noticing is that for
Downloads.download(url; downloader=Downloads.Downloader(;grace=0))
we never see lock conflicts.
Nor for
fetch(@spawn Downloads.download(url; downloader=Downloads.Downloader(;grace=0)))
But we do for
Threads.@threads for _ in 1:10
Downloads.download(url; downloader=Downloads.Downloader(;grace=0))
end
A small number can be attributed to this lock when generating the Downloaders (which could be more efficiently managed to be task-local in the @thread
for)
https://github.com/JuliaLang/Downloads.jl/blob/d363a02bac50ddff422e14a5aa89464dc7fdad0b/src/Curl/Multi.jl#L11
I think a few things we might want to consider doing are:
1) By default setting setopt(easy, CURLOPT_BUFFERSIZE, 10_000_000)
- increase from the default of 16 kB to 10 MB
2) Increase CONNECT_SEMAPHORE
https://github.com/JuliaLang/Downloads.jl/blob/d363a02bac50ddff422e14a5aa89464dc7fdad0b/src/Curl/Easy.jl#L56
3) Document that when using concurrent tasks it's best to use a task-local Downloader
I'll benchmark the parameter space when on a strong internet connection.
In the meantime I've spun up 4 local http servers on different ports from a ramdisk, and accessing them individually from each task, but I'm not sure it's a fair test.
This is 4 sequential downloads on each of 4 concurrent tasks, 10 times
julia> foo(4, 4, 10)
1532.902446759521 MB/s
2447.040027683379 MB/s
1884.3813202392432 MB/s
2017.8544918394593 MB/s
1973.309438735145 MB/s
1632.805491498009 MB/s
2153.163424235369 MB/s
1935.19328559164 MB/s
1728.2429605092427 MB/s
1970.3497493574216 MB/s
Added results from locally hosted servers here https://github.com/JuliaLang/Downloads.jl/compare/master...ib/increase_buffersize
I think I'm reaching a conclusion here.
1) The current defaults for buffersize & CONNECT_SEMAPHORE
are actually optimal already.
Downloader
from different threads does seem to hit enough lock contention to slow things down. I'll add a note in the docs to highlight that.I'll close.
I'd really like for the defaults to be fast and not require jumping through any hoops. We've put so much work into making this concurrent under the hood with a simple blocking API, seems lame to not make it actually fast. So I'd like to actually fix this and make the warning unnecessary.
I have a few paths I'd like to explore. There's an option for multithreaded libcurl usage—there might be a way to make that work and avoid needing a lock. Another option would be to create a Downloader object per thread. The only way to observe that is the easy hook stuff which is an unofficial and not great API anyway, so we could reasonably do that. The other approach would be something with channels or other ways to avoid locks.
Trying to investigate this at the JuliaCon hackathon. Opening for discussion.
Note
1801 lock conflicts
even on the task localDownloader
version