Open danog opened 1 month ago
Hi,
Thanks for benchmarking this. I would also check this locally. Do you use the default provided config from Cachy? Also, which CPU do you have?
I can only retest on a 9950X currently.
checked also with bin-cpuflags-x86 on the compiled binary:
znver4:
bin-cpuflags-x86 /usr/bin/php
Format: Elf
Architecture: X86_64
Features: INTEL8086 INTEL186 INTEL286 INTEL386 INTEL486 X64 AVX AVX2 AVX512_VBMI AVX512BW AVX512DQ AVX512F AVX512VL BMI1 BMI2 CET_IBT CMOV CPUID FPU FPU287 LZCNT MOVBE MULTIBYTENOP PCLMULQDQ POPCNT SSE XSAVE
Warning: CPUID usage detected. The program can switch instruction sets in runtime.
v4:
Format: Elf
Architecture: X86_64
Features: INTEL8086 INTEL186 INTEL286 INTEL386 INTEL486 X64 AVX AVX2 AVX512BW AVX512DQ AVX512F AVX512VL BMI1 BMI2 CET_IBT CMOV CPUID FPU FPU287 LZCNT MOVBE MULTIBYTENOP PCLMULQDQ POPCNT SSE XSAVE
Warning: CPUID usage detected. The program can switch instruction sets in runtime.
AVX512_VBMI
appears to be aditonally applied according bin-cpuflags-x86. Im not sure tho, if it does show all applied flags.
well for us matters if LTO really introduce regression with our php PKGBUILD.
znver4 vs v4 diff can be on the margin of error
Sure, LTO is the real regression, and the margin between znver4 and v4 is small, but it still is significant (and reproducible). I'll publish the scripts and config used for benchmarks in the coming days, in the meantime, I tested on a Ryzen 9 7950X.
https://github.com/CachyOS/CachyOS-PKGBUILDS/commit/206cdf0a226856266b4e82910edae1a2a2f1bd26
Got the LTO regression also verified, disabled LTO for now, as archlinux does.
@ptr1337 I've published the set of scripts used to make the benchmarks: https://github.com/nicelocal/microarch-benchmarks
From https://gitlab.archlinux.org/archlinux/packaging/packages/php/-/merge_requests/3: as can be seen by the benchmarks, the new znver4 repos actually have worse performance than the x86-64-v4 repos (both OOTB with packages from the repo, and when self-building php with or without LTO).
This seems quite strange to me, as I've looked through GCC's source code, specifically the flag selection logic for the various arches, and I've verified znver4 is a strict superset of x86-64-v4:
x86-64-v4:
znver4:
And same goes for the processor info flags:
{"x86-64-v4", PROCESSOR_K8, CPU_GENERIC, PTA_X86_64_V4 | PTA_NO_TUNE, 0, P_NONE}
{"znver4", PROCESSOR_ZNVER4, CPU_ZNVER4, PTA_ZNVER4, M_CPU_SUBTYPE (AMDFAM19H_ZNVER4), P_PROC_AVX512F}
So I can't explain the weird performance hit of znver4...
Note that all tests were fully automated using docker, actually the exact same dockerfile was used, switching out just the architecture in makepkg.conf and in the repos (appropriately re-installing all packages after doing that).