SIMD is unnormal when I compute non power of 2 size.

FFTW / fftw3

DO NOT CHECK OUT THESE FILES FROM GITHUB UNLESS YOU KNOW WHAT YOU ARE DOING. (See below.)

GNU General Public License v2.0

2.67k stars 652 forks source link

SIMD is unnormal when I compute non power of 2 size. #257

Closed pidanself closed 2 years ago

pidanself commented 2 years ago

I'm trying to port fftw to fit RISC SIMD instructions.

When I use tests/bench to test 18225 1D FFT by using one thread, it won't use simd codelets but only use scalar codelets. But simd is normal for size of 12000.

The situation puzzles me. I want to use simd codelets to compute 18225 1D FFT and look for why simd is unnormal.

How do I force tests/bench to only use simd codelets? One way which I can think of is that delete all scalar codelets, but it's too violent.

Lqlsoftware commented 2 years ago

You can try fftw-wisdom and fftw-wisdom for generating pure C code of a specific plan of execution.

stevengj commented 2 years ago

Works for me with AVX2 SIMD:

stevenj@macbook-pro-123 fftw3 % tests/bench -v2 -o patient 18225
planner time: 4.55153 s
(dft-ct-dit/9
  (dftw-direct-9/32 "t1fuv_9_avx")
  (dft-ct-dit/9
    (dftw-direct-9/32-x9 "t1fuv_9_avx")
    (dft-vrank>=1-x9/1
      (dft-ct-dit/15
        (dftw-direct-15/28-x9 "t1fv_15_avx2_128")
        (dft-vrank>=1-x9/-1
          (dft-direct-15-x15 "n1fv_15_avx2"))))))
flops: 207818 add, 106142 mul, 57268 fma
estimated cost: 428527.415900, pcost = 323015.000000
Problem: 18225, setup: 4.55 s, time: 126.21 us, ``mflops'': 10219.001
Took 8 measurements for at least 10.00 ms each.
Time: min 126.21 us, max 151.33 us, avg 137.93 us, median 138.53 us

(The t1f*v and n1f*v codelets are SIMD.)

pidanself commented 2 years ago

Works for me with AVX2 SIMD:

stevenj@macbook-pro-123 fftw3 % tests/bench -v2 -o patient 18225
planner time: 4.55153 s
(dft-ct-dit/9
  (dftw-direct-9/32 "t1fuv_9_avx")
  (dft-ct-dit/9
    (dftw-direct-9/32-x9 "t1fuv_9_avx")
    (dft-vrank>=1-x9/1
      (dft-ct-dit/15
        (dftw-direct-15/28-x9 "t1fv_15_avx2_128")
        (dft-vrank>=1-x9/-1
          (dft-direct-15-x15 "n1fv_15_avx2"))))))
flops: 207818 add, 106142 mul, 57268 fma
estimated cost: 428527.415900, pcost = 323015.000000
Problem: 18225, setup: 4.55 s, time: 126.21 us, ``mflops'': 10219.001
Took 8 measurements for at least 10.00 ms each.
Time: min 126.21 us, max 151.33 us, avg 137.93 us, median 138.53 us

(The t1f*v and n1f*v codelets are SIMD.)

what t1fuv_* means? Its input data is unaligned? Or its stride could cause unaligned? :)

pidanself commented 2 years ago

Works for me with AVX2 SIMD:

stevenj@macbook-pro-123 fftw3 % tests/bench -v2 -o patient 18225
planner time: 4.55153 s
(dft-ct-dit/9
  (dftw-direct-9/32 "t1fuv_9_avx")
  (dft-ct-dit/9
    (dftw-direct-9/32-x9 "t1fuv_9_avx")
    (dft-vrank>=1-x9/1
      (dft-ct-dit/15
        (dftw-direct-15/28-x9 "t1fv_15_avx2_128")
        (dft-vrank>=1-x9/-1
          (dft-direct-15-x15 "n1fv_15_avx2"))))))
flops: 207818 add, 106142 mul, 57268 fma
estimated cost: 428527.415900, pcost = 323015.000000
Problem: 18225, setup: 4.55 s, time: 126.21 us, ``mflops'': 10219.001
Took 8 measurements for at least 10.00 ms each.
Time: min 126.21 us, max 151.33 us, avg 137.93 us, median 138.53 us

(The t1f*v and n1f*v codelets are SIMD.)

what t1fuv_* means? Its input data is unaligned? Or its stride could cause unaligned? :)

Does it mean that if we want to compute 18225 by pure simd-codelets, we must deal some data that are unaligned? If so, fftw will select pure scalar-codelets to compute 18225 in some machines that deal unaligned very slowly.

stevengj commented 2 years ago

what t1fuv_* means?

It's not unaligned, but it sometimes has lower alignment requirements than the t1fv* codelets.

The "u" refers to an alternative storage scheme for the "twiddle factors" ωₖ = exp(2πink/N) used in the Cooley–Tukey steps. In particular, the only difference from the t1fv codelets is that they include t1fu.h rather than t1f.h. In particular, the alignment checks for t1fu.h codelets are implemented by t_okp_t1fu in dft/common/genus.c which calls t_okp_commonu, which calls SIMD_STRIDE_OK on the strides. Whereas the t1f.h codelets use t_okp_t1f, which calls t_okp_common, which calls SIMD_STRIDE_OKA. These SIMD_STRIDE_OK and SIMD_STRIDE_OKA checks, in turn, are implemented in simd-common.h, and the latter imposes a more stringent alignment check for some architectures and precisions.

pidanself commented 2 years ago

what t1fuv_* means?

It's not unaligned, but it sometimes has lower alignment requirements than the t1fv* codelets.

The "u" refers to an alternative storage scheme for the "twiddle factors" ωₖ = exp(2πink/N) used in the Cooley–Tukey steps. In particular, the only difference from the t1fv codelets is that they include t1fu.h rather than t1f.h. In particular, the alignment checks for t1fu.h codelets are implemented by t_okp_t1fu in dft/common/genus.c which calls t_okp_commonu, which calls SIMD_STRIDE_OK on the strides. Whereas the t1f.h codelets use t_okp_t1f, which calls t_okp_common, which calls SIMD_STRIDE_OKA. These SIMD_STRIDE_OK and SIMD_STRIDE_OKA checks, in turn, are implemented in simd-common.h, and the latter imposes a more stringent alignment check for some architectures and precisions.

I learned a lot about FFTW from your commets. Thanks so much!

pidanself commented 2 years ago

Works for me with AVX2 SIMD:

stevenj@macbook-pro-123 fftw3 % tests/bench -v2 -o patient 18225
planner time: 4.55153 s
(dft-ct-dit/9
  (dftw-direct-9/32 "t1fuv_9_avx")
  (dft-ct-dit/9
    (dftw-direct-9/32-x9 "t1fuv_9_avx")
    (dft-vrank>=1-x9/1
      (dft-ct-dit/15
        (dftw-direct-15/28-x9 "t1fv_15_avx2_128")
        (dft-vrank>=1-x9/-1
          (dft-direct-15-x15 "n1fv_15_avx2"))))))
flops: 207818 add, 106142 mul, 57268 fma
estimated cost: 428527.415900, pcost = 323015.000000
Problem: 18225, setup: 4.55 s, time: 126.21 us, ``mflops'': 10219.001
Took 8 measurements for at least 10.00 ms each.
Time: min 126.21 us, max 151.33 us, avg 137.93 us, median 138.53 us

(The t1f*v and n1f*v codelets are SIMD.)

If I use generic256 and set ALIGNMENT as 32 in simd-common.h, fftw will compute 18225 by scalar only. If ALIGNMENT is set 16, fftw will compute 18225 by simd and get better performance. Meanwhile, above computation is about double precision. How to explain this situation? Thank you very much for your patience.

(base) ➜ tests ./bench -r 1 -v2 -o patient 18225 planner time: 2.1084 s (dft-ct-dit/15 (dftw-direct-15/28 "t1_15") (dft-ct-dit/9 (dftw-direct-9/16-x15 "t1_9") (dft-vrank>=1-x9/1 (dft-ct-dit/9 (dftw-direct-9/16-x15 "t1_9") (dft-vrank>=1-x9/1 (dft-direct-15-x15 "n1_15")))))) flops: 554040 add, 247860 mul, 247860 fma estimated cost: 1297651.415900, pcost = 490152.000000 Problem: 18225, setup: 2.11 s, time: 220.23 us, ``mflops'': 5856.2595 Took 1 measurements for at least 10.00 ms each. Time: min 220.23 us, max 220.23 us, avg 220.23 us, median 220.23 us

stevengj commented 2 years ago

If I use generic256 and set ALIGNMENT as 32 in simd-common.h, fftw will compute 18225 by scalar only.

Because 18225 has all odd factors, the subtransforms have odd strides. e.g. if you do a radix-9 Cooley–Tukey step, it does 9 sub-transforms of size N/9=2025, each of which operates on inputs in[i*9] (every 9th element). Hence, even if in is 32-byte aligned, the 16-byte complex number in[i*9] is generally only 16-byte aligned.