FFTW / fftw3

DO NOT CHECK OUT THESE FILES FROM GITHUB UNLESS YOU KNOW WHAT YOU ARE DOING. (See below.)
GNU General Public License v2.0
2.72k stars 661 forks source link

Question about simd code #278

Closed sh-zheng closed 2 years ago

sh-zheng commented 2 years ago

Dear developers,

In head files of simd, such as simd-generic128.h and simd-generic256.h, STM2, STN2, STM4, STN4 are decleared. What operation do they do in fftw? I found that some of them lack implementations, such as STN2 and STM4 in simd-generic128.h, and STM2 and STM4 in simd-generic256.h. Does the lake of them cause problems? When I implement simd on a new architecture, I need to implement which of them?

Lqlsoftware commented 2 years ago

Store SIMD vector with "ovs" (stride of SIMD store) in codelets may cause poor performance on some platforms. There is an optimization when "os" (stride of output data) comes to 2, STN2/STN4 provide a way to combine 2/4 store operations, which can be shuffled as 2/4 continuous (ovs=2) SIMD vector store. STM2/STM4 is the opposite of STN2/STN4, which do the same thing as the original store (ST).

The example in avx2.h:

#define STM2 ST
#define STN2(x, v0, v1, ovs) /* nop */

#define STM4(x, v, ovs, aligned_like) /* no-op */
#define STN4(x, v0, v1, v2, v3, ovs) \
    { \
        V xxx0, xxx1, xxx2, xxx3; \
        xxx0 = _mm256_unpacklo_pd(v0, v1); \
        xxx1 = _mm256_unpackhi_pd(v0, v1); \
        xxx2 = _mm256_unpacklo_pd(v2, v3); \
        xxx3 = _mm256_unpackhi_pd(v2, v3); \
        STA(x, _mm256_permute2f128_pd(xxx0, xxx2, 0x20), 0, 0);  \
        STA(x + ovs, _mm256_permute2f128_pd(xxx1, xxx3, 0x20), 0, 0); \
        STA(x + 2 * ovs, _mm256_permute2f128_pd(xxx0, xxx2, 0x31), 0, 0); \
        STA(x + 3 * ovs, _mm256_permute2f128_pd(xxx1, xxx3, 0x31), 0, 0); \
    }
#endif

So you need to implement one of STN2/STM2.

sh-zheng commented 2 years ago

Thank you so much for your explanation. Now I can comprehend that STN2 and STM2 always occur in code simultaneously and they do the same computation. So only one of them should be implemented according to the performance.