fhanau / Efficient-Compression-Tool

Fast and effective C++ file optimizer
Apache License 2.0
581 stars 40 forks source link

Prebuilt binary with PGO here #141

Open kkocdko opened 3 months ago

kkocdko commented 3 months ago

Update 20240902: use this newer version then run ./zcodecs ect xxx.

Profile-Guided Optimizations enabled.

[kkocdko@klf misc]$ ./hyperfine -w 1 -r 5 './ect_flto -5 1.png 2.png 3.png '
Benchmark 1: ./ect_flto -5 1.png 2.png 3.png 
  Time (mean ± σ):      5.400 s ±  0.011 s    [User: 5.351 s, System: 0.035 s]
  Range (min … max):    5.389 s …  5.411 s    5 runs

[kkocdko@klf misc]$ ./hyperfine -w 1 -r 5 './ect_flto_pgo -5 1.png 2.png 3.png '
Benchmark 1: ./ect_flto_pgo -5 1.png 2.png 3.png 
  Time (mean ± σ):      4.481 s ±  0.014 s    [User: 4.428 s, System: 0.042 s]
  Range (min … max):    4.469 s …  4.503 s    5 runs

[kkocdko@klf misc]$ 

The real result depends on your workload.

120

ghtm2 commented 2 weeks ago

For anyone who might stumble upon this: The addition of a x86-64 micro architecture level can squeeze out some more performance, depending upon the compression level and hardware capabilities.

Benchmark 1 = plain build Benchmark 2 = the binary linked above Benchmark 3 = ltoed, pgoed and x86-64-v3 leveled build Benchmark 4 = ltoed, pgoed and x86-64-v4 leveled build

Benchmark 1: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      3.139 s ±  0.014 s    [User: 3.102 s, System: 0.031 s]
  Range (min … max):    3.120 s …  3.159 s    5 runs

Benchmark 2: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      2.911 s ±  0.013 s    [User: 2.878 s, System: 0.026 s]
  Range (min … max):    2.895 s …  2.926 s    5 runs

Benchmark 3: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      2.865 s ±  0.005 s    [User: 2.828 s, System: 0.030 s]
  Range (min … max):    2.858 s …  2.871 s    5 runs

Benchmark 4: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      2.880 s ±  0.002 s    [User: 2.843 s, System: 0.030 s]
  Range (min … max):    2.878 s …  2.882 s    5 runs

Summary
  mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 /tmp/tst; rm -rf /tmp/tst ran
    1.01 ± 0.00 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 /tmp/tst; rm -rf /tmp/tst
    1.02 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert /tmp/tst; rm -rf /tmp/tst
    1.10 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect /tmp/tst; rm -rf /tmp/tst

At default settings the difference is neglegible, if that is all you use, don't bother.

Benchmark 1: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -5 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      6.439 s ±  0.096 s    [User: 6.389 s, System: 0.037 s]
  Range (min … max):    6.334 s …  6.548 s    5 runs

Benchmark 2: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -5 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      5.213 s ±  0.017 s    [User: 5.164 s, System: 0.037 s]
  Range (min … max):    5.193 s …  5.230 s    5 runs

Benchmark 3: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -5 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      4.358 s ±  0.016 s    [User: 4.307 s, System: 0.040 s]
  Range (min … max):    4.340 s …  4.379 s    5 runs

Benchmark 4: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -5 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      4.258 s ±  0.010 s    [User: 4.208 s, System: 0.040 s]
  Range (min … max):    4.251 s …  4.276 s    5 runs

Summary
  mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -5 /tmp/tst; rm -rf /tmp/tst ran
    1.02 ± 0.00 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -5 /tmp/tst; rm -rf /tmp/tst
    1.22 ± 0.00 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -5 /tmp/tst; rm -rf /tmp/tst
    1.51 ± 0.02 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -5 /tmp/tst; rm -rf /tmp/tst

Does almost as much as adding pgo did.

Benchmark 1: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -9 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):     65.767 s ±  0.164 s    [User: 65.578 s, System: 0.052 s]
  Range (min … max):   65.602 s … 66.035 s    5 runs

Benchmark 2: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -9 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):     43.676 s ±  0.030 s    [User: 43.521 s, System: 0.052 s]
  Range (min … max):   43.637 s … 43.711 s    5 runs

Benchmark 3: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -9 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):     27.658 s ±  0.162 s    [User: 27.531 s, System: 0.056 s]
  Range (min … max):   27.488 s … 27.927 s    5 runs

Benchmark 4: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -9 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):     25.154 s ±  0.079 s    [User: 25.034 s, System: 0.054 s]
  Range (min … max):   25.058 s … 25.256 s    5 runs

Summary
  mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -9 /tmp/tst; rm -rf /tmp/tst ran
    1.10 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -9 /tmp/tst; rm -rf /tmp/tst
    1.74 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -9 /tmp/tst; rm -rf /tmp/tst
    2.61 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -9 /tmp/tst; rm -rf /tmp/tst

Quite a bump, shaves off at least 16 seconds and more than halves the time when compared to the plain build.

kkocdko commented 2 weeks ago

@ghtm2 Could you provide your binary? In that day, I tested the avx256 and avx512 build but it run even slower in my machine (AMD R5 5600U {zen3}). If enable avx will faster it's quiet a big bump! And, which CPU is used in your benchmark?

kkocdko commented 2 weeks ago

@ghtm2 Hi, did you have nasm installed while building the binary?

ghtm2 commented 1 week ago

@ghtm2 Could you provide your binary? In that day, I tested the avx256 and avx512 build but it run even slower in my machine (AMD R5 5600U {zen3}). If enable avx will faster it's quiet a big bump! And, which CPU is used in your benchmark?

Sure, here are the v3 and v4 binaries: ect.tar.gz You'll need at least glibc 2.38 installed though. The CPU used is a AMD Ryzen 7 7840U, so Zen 4.

@ghtm2 Hi, did you have nasm installed while building the binary?

Yes.

kkocdko commented 1 week ago

@ghtm2 Awesome! Your binary is much faster, how did you do that? I append -march=x86-64-v3 -mavx2 here, but it's even slower, increase my benchmark from 48s to 1m27s, and your ect_v3 binary is 26s.

https://github.com/fhanau/Efficient-Compression-Tool/blob/9aabc23d73899ae55c1de292592fed6eb6217f66/src/CMakeLists.txt#L110-L114

And, my whole build script here, I ran build with llvm-19, did you use GCC?:

https://github.com/clevert-app/clevert/blob/main/.github/workflows/asset_zcodecs.yml#L171

I really, really want to replicate your success.

kkocdko commented 1 week ago

I objdump your binary, GCC 14.2.1?

kkocdko commented 1 week ago

I reproduced your benchmark. It's faster using GCC instead of Clang. I will try to tweak it more. Thank you!

ghtm2 commented 1 week ago

Sorry for the glacial response times, I'm quite busy at the moment.

Yes, I've build it with GCC 14.2.1 as that is what's currently shipped on Arch. I can also confirm, that Clang produces noticeably slower ect binaries, no matter the flags.

I've made a small howto to reproduce the build for arch and derivatives: howto.tar.gz

I'm pretty sure that there is still some performance to be had with the appropriate flags and better input for PGO. One might also want to try to further optimize with bolt, but I currently don't have the time to try.