ermig1979 / Simd

C++ image processing and machine learning library with using of SIMD: SSE, AVX, AVX-512, AMX for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC, NEON for ARM.
http://ermig1979.github.io/Simd
MIT License
2.05k stars 409 forks source link

No neon instructions used after compiling for ARM #270

Closed TonyCongqianWang closed 6 months ago

TonyCongqianWang commented 6 months ago

I was wondering how to enable neon instructions since there is no SIMD_NEON option for cmake compilation. I cloned the current repo (date 2024-04-02) and compiled on my Raspberry Pi with cmake and make:

cmake ../prj/cmake -DSIMD_TOOLCHAIN="" -DSIMD_TARGET="" && make The Output was:

Simd Library: Build type: 'Release' Target: aarch64 Library type: STATIC Toolchain: /usr/bin/c++ Compiler ID: GNU Compiler Version: 12.2.0 Test framework: ON Performance statistic: OFF Synet: ON Debug INT8: OFF Hide internal: OFF AMX emulation: OFF Runtime algorithm choise: ON OpenCV tests: OFF Install target: ON Uninstall target: ON Python wrapper: ON Binutils Version: 2.40 Extract project version: Last project version '6.1.136.master-70225b50' is equal to current version '6.1.136.master-70225b50'.

I checked whether any neon instructions were used with

objdump -d libSimd.a > simd.asm
awk '/[ \t](vmov|vld|vst|vadd|vsub|vmul|vdiv|vceq|vcge|vcgt|vbsl|vrecpe|vrsqrte|vneg|vabs|vext|vtbl|vtrn|vld1|vst1)[ \t]/' simd.asm

but none where found! When I just searched for "Neon" there were lots of matches, but when I looked at one Neon Function it was the following:

0000000000000000 <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm>:
   0:   927cecc8        and     x8, x6, #0xfffffffffffffff0
   4:   f2400c1f        tst     x0, #0xf
   8:   54000420        b.eq    8c <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm+0x8c>  // b.none
   c:   b40003e7        cbz     x7, 88 <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm+0x88>
  10:   92400ccd        and     x13, x6, #0xf
  14:   d10040cc        sub     x12, x6, #0x10
  18:   9105c0c6        add     x6, x6, #0x170
  1c:   d280000b        mov     x11, #0x0                       // #0
  20:   9106000a        add     x10, x0, #0x180
  24:   91060049        add     x9, x2, #0x180
  28:   d2800005        mov     x5, #0x0                        // #0
  2c:   b4000148        cbz     x8, 54 <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm+0x54>
  30:   3ce56800        ldr     q0, [x0, x5]
  34:   3ce56841        ldr     q1, [x2, x5]
  38:   f8a56940        prfm    pldl1keep, [x10, x5]
  3c:   f8a56920        prfm    pldl1keep, [x9, x5]
  40:   6e217400        uabd    v0.16b, v0.16b, v1.16b
  44:   3ca56880        str     q0, [x4, x5]
  48:   910040a5        add     x5, x5, #0x10
  4c:   eb05011f        cmp     x8, x5
  50:   54ffff08        b.hi    30 <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm+0x30>  // b.pmore
  54:   b40000ed        cbz     x13, 70 <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm+0x70>
  58:   3cec6800        ldr     q0, [x0, x12]
  5c:   3cec6841        ldr     q1, [x2, x12]
  60:   f8a66800        prfm    pldl1keep, [x0, x6]
  64:   f8a66840        prfm    pldl1keep, [x2, x6]
  68:   6e217400        uabd    v0.16b, v0.16b, v1.16b
  6c:   3cac6880        str     q0, [x4, x12]
  70:   9100056b        add     x11, x11, #0x1
  74:   8b010000        add     x0, x0, x1
  78:   8b030042        add     x2, x2, x3
  7c:   8b030084        add     x4, x4, x3
  80:   eb0b00ff        cmp     x7, x11
  84:   54fffce1        b.ne    20 <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm+0x20>  // b.any
  88:   d65f03c0        ret
  8c:   f2400c5f        tst     x2, #0xf
  90:   54fffbe1        b.ne    c <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm+0xc>  // b.any
  94:   aa030025        orr     x5, x1, x3
  98:   f2400cbf        tst     x5, #0xf
  9c:   54fffb81        b.ne    c <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm+0xc>  // b.any
  a0:   b4ffff47        cbz     x7, 88 <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm+0x88>
  a4:   92400cce        and     x14, x6, #0xf
  a8:   d10040cc        sub     x12, x6, #0x10
  ac:   9105c0cd        add     x13, x6, #0x170
  b0:   d280000b        mov     x11, #0x0                       // #0
  b4:   9106000a        add     x10, x0, #0x180
  b8:   91060049        add     x9, x2, #0x180
  bc:   d2800005        mov     x5, #0x0                        // #0
  c0:   f1003cdf        cmp     x6, #0xf
  c4:   54000149        b.ls    ec <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm+0xec>  // b.plast
  c8:   3ce068a0        ldr     q0, [x5, x0]
  cc:   3ce268a1        ldr     q1, [x5, x2]
  d0:   f8a56940        prfm    pldl1keep, [x10, x5]
  d4:   f8a56920        prfm    pldl1keep, [x9, x5]
  d8:   6e217400        uabd    v0.16b, v0.16b, v1.16b
  dc:   3ca56880        str     q0, [x4, x5]
  e0:   910040a5        add     x5, x5, #0x10
  e4:   eb05011f        cmp     x8, x5
  e8:   54ffff08        b.hi    c8 <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm+0xc8>  // b.pmore
  ec:   b40000ee        cbz     x14, 108 <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm+0x108>
  f0:   3cec6800        ldr     q0, [x0, x12]
  f4:   3cec6841        ldr     q1, [x2, x12]
  f8:   f8ad6800        prfm    pldl1keep, [x0, x13]
  fc:   f8ad6840        prfm    pldl1keep, [x2, x13]
 100:   6e217400        uabd    v0.16b, v0.16b, v1.16b
 104:   3cac6880        str     q0, [x4, x12]
 108:   9100056b        add     x11, x11, #0x1
 10c:   8b010000        add     x0, x0, x1
 110:   8b030042        add     x2, x2, x3
 114:   8b030084        add     x4, x4, x3
 118:   eb0b00ff        cmp     x7, x11
 11c:   54fffcc1        b.ne    b4 <_ZN4Simd4Neon13AbsDifferenceEPKhmS2_mPhmmm+0xb4>  // b.any
 120:   d65f03c0        ret

It did use the SIMD instruction uabd, but no neon instruction was used. Did I make any mistakes during compilation, or are there actually no NEON instructions used in the current build?

TonyCongqianWang commented 6 months ago

It seems that I had an outdated list of neon instructions (only armv7). When searching with awk '/[ \t](addv|fsqrt|fmulx|fdiv|saddlv|uaddlv)[ \t]/' simd.asm I found the neon instructions