ARM-software / optimized-routines

Optimized implementations of various library functions for ARM architecture processors
Other
601 stars 96 forks source link

Use vcagtq_f32 to replace vabsq_f32 and > #70

Open helloguo opened 5 months ago

helloguo commented 5 months ago

Vector compare greater than (>) is compiled to fcmgt when compiling with clang-15 and flags -O3 -ffast-math -fno-finite-math-only -ffp-contract=off -fno-unsafe-math-optimizations (for example https://godbolt.org/z/9nhPhzc1n). However, clang-16 generates different code sequence as shown below. (for example https://godbolt.org/z/s4ba1xd74)

        mov     s2, v1.s[1]
        mov     s3, v0.s[1]
        fcmpe   s0, s1
        mov     s4, v1.s[2]
        mov     s5, v0.s[2]
        mov     s1, v1.s[3]
        mov     s0, v0.s[3]
        csetm   w8, gt
        fcmpe   s3, s2
        fmov    s2, w8
        csetm   w8, gt
        fcmpe   s5, s4
        mov     v2.s[1], w8
        csetm   w8, gt
        fcmpe   s0, s1
        mov     v2.s[2], w8
        csetm   w8, gt
        mov     v2.s[3], w8

This PR uses vcagtq_f32 to replace vabsq_f32 and >, to avoid the long code sequence.

Test with clang-16:

helloguo commented 5 months ago

@Wilco1 @nsz-arm @joeramsay can you take a look?