Open kkocdko opened 5 months ago
For anyone who might stumble upon this: The addition of a x86-64 micro architecture level can squeeze out some more performance, depending upon the compression level and hardware capabilities.
Benchmark 1 = plain build Benchmark 2 = the binary linked above Benchmark 3 = ltoed, pgoed and x86-64-v3 leveled build Benchmark 4 = ltoed, pgoed and x86-64-v4 leveled build
Benchmark 1: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect /tmp/tst; rm -rf /tmp/tst
Time (mean ± σ): 3.139 s ± 0.014 s [User: 3.102 s, System: 0.031 s]
Range (min … max): 3.120 s … 3.159 s 5 runs
Benchmark 2: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert /tmp/tst; rm -rf /tmp/tst
Time (mean ± σ): 2.911 s ± 0.013 s [User: 2.878 s, System: 0.026 s]
Range (min … max): 2.895 s … 2.926 s 5 runs
Benchmark 3: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 /tmp/tst; rm -rf /tmp/tst
Time (mean ± σ): 2.865 s ± 0.005 s [User: 2.828 s, System: 0.030 s]
Range (min … max): 2.858 s … 2.871 s 5 runs
Benchmark 4: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 /tmp/tst; rm -rf /tmp/tst
Time (mean ± σ): 2.880 s ± 0.002 s [User: 2.843 s, System: 0.030 s]
Range (min … max): 2.878 s … 2.882 s 5 runs
Summary
mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 /tmp/tst; rm -rf /tmp/tst ran
1.01 ± 0.00 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 /tmp/tst; rm -rf /tmp/tst
1.02 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert /tmp/tst; rm -rf /tmp/tst
1.10 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect /tmp/tst; rm -rf /tmp/tst
At default settings the difference is neglegible, if that is all you use, don't bother.
Benchmark 1: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -5 /tmp/tst; rm -rf /tmp/tst
Time (mean ± σ): 6.439 s ± 0.096 s [User: 6.389 s, System: 0.037 s]
Range (min … max): 6.334 s … 6.548 s 5 runs
Benchmark 2: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -5 /tmp/tst; rm -rf /tmp/tst
Time (mean ± σ): 5.213 s ± 0.017 s [User: 5.164 s, System: 0.037 s]
Range (min … max): 5.193 s … 5.230 s 5 runs
Benchmark 3: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -5 /tmp/tst; rm -rf /tmp/tst
Time (mean ± σ): 4.358 s ± 0.016 s [User: 4.307 s, System: 0.040 s]
Range (min … max): 4.340 s … 4.379 s 5 runs
Benchmark 4: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -5 /tmp/tst; rm -rf /tmp/tst
Time (mean ± σ): 4.258 s ± 0.010 s [User: 4.208 s, System: 0.040 s]
Range (min … max): 4.251 s … 4.276 s 5 runs
Summary
mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -5 /tmp/tst; rm -rf /tmp/tst ran
1.02 ± 0.00 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -5 /tmp/tst; rm -rf /tmp/tst
1.22 ± 0.00 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -5 /tmp/tst; rm -rf /tmp/tst
1.51 ± 0.02 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -5 /tmp/tst; rm -rf /tmp/tst
Does almost as much as adding pgo did.
Benchmark 1: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -9 /tmp/tst; rm -rf /tmp/tst
Time (mean ± σ): 65.767 s ± 0.164 s [User: 65.578 s, System: 0.052 s]
Range (min … max): 65.602 s … 66.035 s 5 runs
Benchmark 2: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -9 /tmp/tst; rm -rf /tmp/tst
Time (mean ± σ): 43.676 s ± 0.030 s [User: 43.521 s, System: 0.052 s]
Range (min … max): 43.637 s … 43.711 s 5 runs
Benchmark 3: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -9 /tmp/tst; rm -rf /tmp/tst
Time (mean ± σ): 27.658 s ± 0.162 s [User: 27.531 s, System: 0.056 s]
Range (min … max): 27.488 s … 27.927 s 5 runs
Benchmark 4: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -9 /tmp/tst; rm -rf /tmp/tst
Time (mean ± σ): 25.154 s ± 0.079 s [User: 25.034 s, System: 0.054 s]
Range (min … max): 25.058 s … 25.256 s 5 runs
Summary
mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -9 /tmp/tst; rm -rf /tmp/tst ran
1.10 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -9 /tmp/tst; rm -rf /tmp/tst
1.74 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -9 /tmp/tst; rm -rf /tmp/tst
2.61 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -9 /tmp/tst; rm -rf /tmp/tst
Quite a bump, shaves off at least 16 seconds and more than halves the time when compared to the plain build.
@ghtm2 Could you provide your binary? In that day, I tested the avx256 and avx512 build but it run even slower in my machine (AMD R5 5600U {zen3}). If enable avx will faster it's quiet a big bump! And, which CPU is used in your benchmark?
@ghtm2 Hi, did you have nasm
installed while building the binary?
@ghtm2 Could you provide your binary? In that day, I tested the avx256 and avx512 build but it run even slower in my machine (AMD R5 5600U {zen3}). If enable avx will faster it's quiet a big bump! And, which CPU is used in your benchmark?
Sure, here are the v3 and v4 binaries: ect.tar.gz You'll need at least glibc 2.38 installed though. The CPU used is a AMD Ryzen 7 7840U, so Zen 4.
@ghtm2 Hi, did you have
nasm
installed while building the binary?
Yes.
@ghtm2 Awesome! Your binary is much faster, how did you do that? I append -march=x86-64-v3 -mavx2
here, but it's even slower, increase my benchmark from 48s to 1m27s, and your ect_v3
binary is 26s.
And, my whole build script here, I ran build with llvm-19, did you use GCC?:
https://github.com/clevert-app/clevert/blob/main/.github/workflows/asset_zcodecs.yml#L171
I really, really want to replicate your success.
I objdump your binary, GCC 14.2.1?
I reproduced your benchmark. It's faster using GCC instead of Clang. I will try to tweak it more. Thank you!
Sorry for the glacial response times, I'm quite busy at the moment.
Yes, I've build it with GCC 14.2.1 as that is what's currently shipped on Arch. I can also confirm, that Clang produces noticeably slower ect binaries, no matter the flags.
I've made a small howto to reproduce the build for arch and derivatives: howto.tar.gz
I'm pretty sure that there is still some performance to be had with the appropriate flags and better input for PGO. One might also want to try to further optimize with bolt, but I currently don't have the time to try.
Profile-Guided Optimizations enabled.
The real result depends on your workload.
120