FFTW / fftw3

DO NOT CHECK OUT THESE FILES FROM GITHUB UNLESS YOU KNOW WHAT YOU ARE DOING. (See below.)
GNU General Public License v2.0
2.73k stars 664 forks source link

Threaded wisdom failure when avx512 is enabled and plan rigor > FFTW_ESTIMATE #156

Open dfarns opened 6 years ago

dfarns commented 6 years ago

When planning for multiple threads with AVX512 enabled (e.g., skylake-avx512 or knl), fftw_export_wisdom_to_string returns only wisdom header and footer. This occurs when FFTW_MEASURE or FFTW_PATIENT is used; FFTW_ESTIMATE returns wisdom as expected.

This behavior is not observed for single thread planning with AVX512, nor for threaded planning on skylake-avx512 cpus when AVX512 is omitted from the build (e.g., configured using --enable-avx2 and --enable-openmp only).

The library build was configured with minimal options (e.g., --enable-avx512 --enable-openmp) and built with gcc 6.1.0 and 8.1.0. Adding the recommended --enable-avx2 does not help. This occurs for each of the libfftw3_omp and libfftw3_threads (built with --enable-threads) libs. Built and tested on centos7 and sles12 with same behavior.

** NOTE: The plans returned with FFTW_MEASURE and FFTW_PATIENT are the same as that for FFTW_ESTIMATE, suggesting that the planner is finding no applicable (or none at all) wisdom and reverting to the FFTW_ESTIMATE behavior.

Sample output for a N=1024 C2C in-place transform (using FFTW_MEASURE):

./a.omp.out -nthreads 2
Planning with 2 threads.

plan =
(dft-thr-ct-dit-x2/32
  (dftw-direct-32/32 "t3fv_32_avx512")
  (dftw-direct-32/32 "t3fv_32_avx512")
  (dft-buffered-32-x32/32-6
    (dft-thr-vrank>=1-x2/1
      (dft-direct-32-x16 "n2fv_32_avx512")
      (dft-direct-32-x16 "n2fv_32_avx512"))
    (dft-r2hc-1
      (rdft-thr-vrank>=1-x2/1
        (rdft-rank0-iter-ci/64-x16)
        (rdft-rank0-iter-ci/64-x16)))
    (dft-nop)))

wisdom =
(fftw-3.3.8 fftw_wisdom #x563d20a7 #x21166e89 #x51240a27 #x4128fc72
)

Build script:

#!/bin/bash

ARCH="x86_skylake"
enable_optimizations='--enable-avx512'
GCC_VERSION="8.1.0"

BINPATH=/opt/gcc/${GCC_VERSION}/bin
CONFIG_CC=${BINPATH}/gcc
CONFIG_CPP=${BINPATH}/cpp
CONFIG_CXX=${BINPATH}/g++
CONFIG_F77=${BINPATH}/gfortran

PREFIX=$(pwd)/fftw_install/${ARCH}

make distclean 2>&1 | tee log.make-distclean.txt

./configure \
    --prefix="$PREFIX"              \
    $enable_optimizations           \
    --enable-openmp                 \
    --enable-threads                \
    CC=$CONFIG_CC                   \
    CPP=$CONFIG_CPP                 \
    CXX=$CONFIG_CXX                 \
    F77=$CONFIG_F77                 \
    2>&1 | tee log.configure.txt

make -j 16 2>&1 | tee log.make.txt
make install 2>&1 | tee log.make-install.txt

Reproducer:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fftw3.h>
#ifdef _USEOMP
#  include <omp.h>
#endif

int main(int argc, char *argv[]) {

  char *wisstring;
  const int N = 1024;
  fftw_plan plan;
  fftw_complex *z1=NULL;
  int NthreadsMax, Nthreads = 1;
  int ti, iArg;

  for ( iArg = 1; iArg < argc; iArg++ ) {
   if ( strcmp(argv[iArg],"-nthreads") == 0 && iArg < argc - 1) {
      Nthreads = (int) atoi(argv[++iArg]);
    }
  }

# ifdef _USEOMP
  NthreadsMax = omp_get_max_threads();
  if ( Nthreads > NthreadsMax ) {
    printf("%d threads requested but only %d threads available.  Setting Nthreads = %d\n", Nthreads, NthreadsMax, NthreadsMax);
    Nthreads = NthreadsMax;
  }
# endif

  if ( Nthreads > 1 ) {
    ti = fftw_init_threads();
    fftw_plan_with_nthreads(Nthreads);
  }

  // Create a simple plan, then print it out
  printf("Planning with %d threads.\n", Nthreads);
  z1 = (fftw_complex*) fftw_malloc(N*sizeof(fftw_complex));
  plan = fftw_plan_dft_1d(N, z1, z1, FFTW_FORWARD, FFTW_MEASURE);
  printf("\n");
  printf("plan = \n");
  fftw_print_plan(plan);
  printf("\n");

  // Print out the wisdom, including the wisdom header
  wisstring = fftw_export_wisdom_to_string();
  printf("\n");
  if ( wisstring != NULL ) {
    printf("wisdom = \n%s\n", wisstring);
  } else {
    printf("wisstring is NULL.");
  }

  // Cleanup
  if ( plan != NULL ) {
    fftw_destroy_plan(plan);
  }
  if ( wisstring != NULL ) {
    free(wisstring);
  }
  if ( z1 != NULL ) {
    fftw_free(z1);
  }

  if ( Nthreads > 1 ) {
    fftw_cleanup_threads();
  } else {
    fftw_cleanup();
  }

  return 0;
}

Here is the same test on a skylake-avx512 cpu with the library configured with avx2 only and run with FFTW_MEASURE (demonstrating that it's unlikely to be a cpu issue):

./a.omp.out -nthreads 2
Planning with 2 threads.

plan =
(dft-thr-ct-dit-x2/16
  (dftw-direct-16/16 "t3fv_16_avx2")
  (dftw-direct-16/16 "t3fv_16_avx2")
  (dft-directbuf/66-64-x16 "n2fv_64_avx2"))

wisdom =
(fftw-3.3.8 fftw_wisdom #x44b8b2bb #xf6415fbb #x528d8d0a #x01cf6e58
  (fftw_codelet_n2fv_64_avx2 1 #x11bdd #x11bdd #x0 #x1922ec23 #xa3fc4eb9 #xce8a9d6e #xac6cc7f0)
  (fftw_codelet_t3fv_16_avx2 1 #x11bdd #x11bdd #x0 #xa737e14d #xd2ed5d58 #x701e154a #x4351c9d2)
)

Ideas?

dfarns commented 5 years ago

Some printf statements in mkapiplan, mkplan, etc are suggesting that the wisdom generated by the planner is considered "bogus" in mkplan() [i.e., (plnr->wisdom_state == WISDOM_IS_BOGUS) is true] and thus rejected. This would explain the empty wisdom string returned by fftw_export_wisdom as well as the planner reversion to FFTW_ESTIMATE rigor.

Any reason why this might happen only when avx512 codelets are enabled and nthreads>1? Possible race condition / wisdom table corruption?

matteo-frigo commented 5 years ago

Probably fixed in ebde7c4e4607afb6bbba7e6609fae56ff0fda01b, can you verify?

dfarns commented 5 years ago

I can verify that the simple reproducer I posted no longer exhibits the deviant behavior. Full validation will take some time.