CachyOS / CachyOS-PKGBUILDS

PKGBUILDs for CachyOS
https://cachyos.org
85 stars 24 forks source link

Weird znver4 performance hit compared to x86-64-v4 #359

Open danog opened 1 month ago

danog commented 1 month ago

From https://gitlab.archlinux.org/archlinux/packaging/packages/php/-/merge_requests/3: as can be seen by the benchmarks, the new znver4 repos actually have worse performance than the x86-64-v4 repos (both OOTB with packages from the repo, and when self-building php with or without LTO).

This seems quite strange to me, as I've looked through GCC's source code, specifically the flag selection logic for the various arches, and I've verified znver4 is a strict superset of x86-64-v4:

x86-64-v4:

PTA_64BIT | PTA_MMX | PTA_SSE
  | PTA_SSE2 | PTA_FXSR
  | PTA_CX16 | PTA_POPCNT | PTA_SSE3 | PTA_SSE4_1 | PTA_SSE4_2 | PTA_SSSE3
  | PTA_AVX | PTA_AVX2 | PTA_BMI | PTA_BMI2 | PTA_F16C | PTA_FMA | PTA_LZCNT
  | PTA_MOVBE | PTA_XSAVE
  | PTA_AVX512F | PTA_AVX512BW | PTA_AVX512CD | PTA_AVX512DQ | PTA_AVX512VL

znver4:

PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2
  | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
  | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_AVX2 | PTA_BMI | PTA_BMI2
  | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT
  | PTA_FSGSBASE | PTA_RDRND | PTA_MOVBE | PTA_MWAITX | PTA_ADX | PTA_RDSEED
  | PTA_CLZERO | PTA_CLFLUSHOPT | PTA_XSAVEC | PTA_XSAVES | PTA_SHA | PTA_LZCNT
  | PTA_POPCNT| PTA_CLWB | PTA_RDPID
  | PTA_WBNOINVD | PTA_VAES | PTA_VPCLMULQDQ
  | PTA_PKU | PTA_ZNVER3 | PTA_AVX512F | PTA_AVX512DQ
  | PTA_AVX512IFMA | PTA_AVX512CD | PTA_AVX512BW | PTA_AVX512VL
  | PTA_AVX512BF16 | PTA_AVX512VBMI | PTA_AVX512VBMI2 | PTA_GFNI
  | PTA_AVX512VNNI | PTA_AVX512BITALG | PTA_AVX512VPOPCNTDQ | PTA_EVEX512

And same goes for the processor info flags:

{"x86-64-v4", PROCESSOR_K8, CPU_GENERIC, PTA_X86_64_V4 | PTA_NO_TUNE, 0, P_NONE}

{"znver4", PROCESSOR_ZNVER4, CPU_ZNVER4, PTA_ZNVER4, M_CPU_SUBTYPE (AMDFAM19H_ZNVER4), P_PROC_AVX512F}

So I can't explain the weird performance hit of znver4...

Note that all tests were fully automated using docker, actually the exact same dockerfile was used, switching out just the architecture in makepkg.conf and in the repos (appropriately re-installing all packages after doing that).

ptr1337 commented 1 month ago

Hi,

Thanks for benchmarking this. I would also check this locally. Do you use the default provided config from Cachy? Also, which CPU do you have?

I can only retest on a 9950X currently.

checked also with bin-cpuflags-x86 on the compiled binary:

znver4:

bin-cpuflags-x86 /usr/bin/php
Format: Elf
Architecture: X86_64
Features: INTEL8086 INTEL186 INTEL286 INTEL386 INTEL486 X64 AVX AVX2 AVX512_VBMI AVX512BW AVX512DQ AVX512F AVX512VL BMI1 BMI2 CET_IBT CMOV CPUID FPU FPU287 LZCNT MOVBE MULTIBYTENOP PCLMULQDQ POPCNT SSE XSAVE 
Warning: CPUID usage detected. The program can switch instruction sets in runtime.

v4:

Format: Elf
Architecture: X86_64
Features: INTEL8086 INTEL186 INTEL286 INTEL386 INTEL486 X64 AVX AVX2 AVX512BW AVX512DQ AVX512F AVX512VL BMI1 BMI2 CET_IBT CMOV CPUID FPU FPU287 LZCNT MOVBE MULTIBYTENOP PCLMULQDQ POPCNT SSE XSAVE 
Warning: CPUID usage detected. The program can switch instruction sets in runtime.

AVX512_VBMI appears to be aditonally applied according bin-cpuflags-x86. Im not sure tho, if it does show all applied flags.

vnepogodin commented 1 month ago

well for us matters if LTO really introduce regression with our php PKGBUILD.

znver4 vs v4 diff can be on the margin of error

danog commented 1 month ago

Sure, LTO is the real regression, and the margin between znver4 and v4 is small, but it still is significant (and reproducible). I'll publish the scripts and config used for benchmarks in the coming days, in the meantime, I tested on a Ryzen 9 7950X.

ptr1337 commented 1 month ago

https://github.com/CachyOS/CachyOS-PKGBUILDS/commit/206cdf0a226856266b4e82910edae1a2a2f1bd26

Got the LTO regression also verified, disabled LTO for now, as archlinux does.

danog commented 1 month ago

@ptr1337 I've published the set of scripts used to make the benchmarks: https://github.com/nicelocal/microarch-benchmarks