LilithHafner / InterLanguageSortingComparisons

6 stars 2 forks source link

`ThreadsX.sort!` #12

Open jariji opened 6 months ago

jariji commented 6 months ago

Idk if parallel libraries are allowed in this repo but this one seems to do a good job.

using ThreadsX, BenchmarkTools

julia> @btime sort!(xs) setup=(xs=rand(10^6));
  32.477 ms (3 allocations: 7.64 MiB)

julia> @btime ThreadsX.sort!(xs) setup=(xs=rand(10^6));
  7.693 ms (14183 allocations: 18.55 MiB)

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900XT 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
  Threads: 17 on 24 virtual cores
Environment:
  JULIA_NUM_THREADS = 12
LilithHafner commented 6 months ago

Parallel libraries are okay. I'd make a note in the legend that it's parallel. However, ThreadsX.sort! seems to have a 2.5-3x overhead vs Base.sort! and my benchmarking machine only has 4 physical cores, so it would not be a particularly good portrail of the algorithm, and I try to only include non-default algorithms if they are substantially better or more popular than the defaults.

I'll leave this open in case I come across a benchmarking machine with more cores or there are enough other threaded algorithms proposed to make a separate chart.

Screenshot from 2024-02-18 12-44-13

Julia Version 1.10.1
Commit 7790d6f0641 (2024-02-13 20:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: 8 × unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, generic)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_EDITOR = code
jariji commented 6 months ago

Sounds good.

I'm curious how overhead is measured?

LilithHafner commented 6 months ago

I calculated overhead = (runtime of ThreadsX.sort!) * threads / (runtime of base.sort!)

# Measured in ms using -t1, -t2, ..., -t8
# @btime ThreadsX.sort!(xs) setup=(xs=rand(10^6));
threadsx = [24.551
            12.859
            8.818
            6.778
            6.369
            6.107
            5.795
            5.776]

# measured once using -t1
# @btime sort!(xs) setup=(xs=rand(10^6));
default = fill(9.025, 8)
threads = 1:8
speedup = default ./ threadsx
overhead_factor = threadsx .* threads ./ default

using GLMakie

f = Figure()
ax = Axis(f[1, 1], ylabel="runtime in ms", xticks=1:8, xlabel="threads", yticks=0:5:25)
plot!(ax, threads, threadsx, label="threadsx")
lines!(ax, threads, default, label="default")
plot!(ax, [1], [0], marker=:plus)
axislegend(ax)

ax2 = Axis(f[1, 2], ylabel="ratio vs. default", xticks=1:8, xlabel="threads", yticks=0:5)
plot!(ax2, threads, speedup, label="runtime speedup")
plot!(ax2, threads, overhead_factor, label="overhead factor")
plot!(ax2, [1], [0], marker=:plus)
axislegend(ax2, position = :ct)

f

This assumes that there is 100% utilization of every thread the whole time, which is probably close but false.