FFTW / fftw3

DO NOT CHECK OUT THESE FILES FROM GITHUB UNLESS YOU KNOW WHAT YOU ARE DOING. (See below.)
GNU General Public License v2.0
2.76k stars 667 forks source link

Planner created more threads than expected #252

Open hominhquan opened 3 years ago

hominhquan commented 3 years ago

Config:

Problem: With the default benchmark/testing program ./tests/bench, I fall into a case where there are more threads created than given by -onthreads=<N>. There are 18 threads created over 16 given.

./configure --enable-threads CFLAGS="-O2 -g -pthread" && make
$ cd tests
$ ./bench -oestimate -onthreads=16 -v1 --verify orf256x256 & pid=$!; ps -T -p $pid 
[1] 8987
  PID  SPID TTY          TIME CMD
 8987  8987 pts/21   00:00:00 bench
 8987  8989 pts/21   00:00:00 bench
 8987  8990 pts/21   00:00:00 bench
 8987  8991 pts/21   00:00:00 bench
 8987  8992 pts/21   00:00:00 bench
 8987  8993 pts/21   00:00:00 bench
 8987  8994 pts/21   00:00:00 bench
 8987  8995 pts/21   00:00:00 bench
 8987  8996 pts/21   00:00:00 bench
 8987  8997 pts/21   00:00:00 bench
 8987  8998 pts/21   00:00:00 bench
 8987  8999 pts/21   00:00:00 bench
 8987  9000 pts/21   00:00:00 bench
 8987  9001 pts/21   00:00:00 bench
 8987  9002 pts/21   00:00:00 bench
 8987  9003 pts/21   00:00:00 bench
 8987  9004 pts/21   00:00:00 bench
 8987  9005 pts/21   00:00:00 bench

$ ./bench -oestimate -onthreads=16 -v1 --verify orf256x256 & pid=$!; ps -T -p $pid | grep $pid | wc -l 
[1] 9092
18

Another tracking with perf record:

$ perf record -s ./bench -oestimate -onthreads=16 -v1 --verify orf256x256
orf256x256 4.12719e-16 8.88178e-16 7.27368e-16
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.502 MB perf.data (12679 samples) ]

$ perf report -n -T
...
#  PID   TID  cycles:u
  9140  9148  74134939
  9140  9155  75067779
  9140  9154  73309548
  9140  9153  69819795
  9140  9147  69004074
  9140  9150  68586053
  9140  9156  68502414
  9140  9141  69636821
  9140  9144  62440256
  9140  9143  60928481
  9140  9152  63249541
  9140  9151  66337967
  9140  9145  61185208
  9140  9142  61414181
  9140  9157  61440558
  9140  9149  62655185
  9140  9146  61406279

And with strace:

$ strace -tt -ff -T -o strace-dump -- ./bench -oestimate -onthreads=16 -v1 --verify orf256x256
$ ls strace-dump* | wc -l
18
hominhquan commented 3 years ago

I saw that with orf256x256, the thread 0 spawned 15 other theads in the first X(spawn_loop), then new threads (I don't know why) also spawn others threads. It can be checked by putting a breakpoint at fftw_spawn_loop:

$ gdb -- ./bench -oestimate -onthreads=16 -v1 --verify orf256x256
Excess command line arguments ignored. (-onthreads=16 ...)
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./bench...done.
(gdb) #r -oestimate -onthreads=16 -v1 --verify orf256x256
(gdb) b fftw_spawn_loop 
Breakpoint 1 at 0xc4c0: file threads.c, line 481.
(gdb) r -oestimate -onthreads=16 -v1 --verify orf256x256
Starting program: /work1/mqho/BU_EMB/KAF_libraries/fftw_bug/fftw-3.3.9/tests/bench -oestimate -onthreads=16 -v1 --verify orf256x256
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, fftw_spawn_loop (loopmax=16, nthr=16, proc=proc@entry=0x555555562690 <spawn_apply>, data=data@entry=0x7fffffffddc0) at threads.c:481
481 {
(gdb) i th 
  Id   Target Id         Frame 
* 1    Thread 0x7ffff7fb4b80 (LWP 20542) "bench" fftw_spawn_loop (loopmax=16, nthr=16, proc=proc@entry=0x555555562690 <spawn_apply>, data=data@entry=0x7fffffffddc0)
    at threads.c:481
(gdb) c
Continuing.
[New Thread 0x7ffff6d1f700 (LWP 20550)]
Core -134526080 created 1 threads
[New Thread 0x7ffff651e700 (LWP 20551)]
Core -134526080 created 2 threads
[New Thread 0x7ffff5d1d700 (LWP 20552)]
Core -134526080 created 3 threads
[New Thread 0x7ffff551c700 (LWP 20553)]
Core -134526080 created 4 threads
[New Thread 0x7ffff4d1b700 (LWP 20554)]
Core -134526080 created 5 threads
[New Thread 0x7ffff451a700 (LWP 20555)]
Core -134526080 created 6 threads
[New Thread 0x7ffff3d19700 (LWP 20556)]
Core -134526080 created 7 threads
[New Thread 0x7ffff3518700 (LWP 20557)]
Core -134526080 created 8 threads
[New Thread 0x7ffff2d17700 (LWP 20558)]
Core -134526080 created 9 threads
[New Thread 0x7ffff2516700 (LWP 20559)]
Core -134526080 created 10 threads
[New Thread 0x7ffff1d15700 (LWP 20560)]
Core -134526080 created 11 threads
[New Thread 0x7ffff1514700 (LWP 20561)]
Core -134526080 created 12 threads
[New Thread 0x7ffff0d13700 (LWP 20562)]
Core -134526080 created 13 threads
[New Thread 0x7ffff0512700 (LWP 20563)]
Core -134526080 created 14 threads
[New Thread 0x7fffefd11700 (LWP 20564)]
Core -134526080 created 15 threads

Thread 1 "bench" hit Breakpoint 1, fftw_spawn_loop (loopmax=3, nthr=3, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7fffffffdd50) at threads.c:481
481 {
(gdb) i th
  Id   Target Id         Frame 
* 1    Thread 0x7ffff7fb4b80 (LWP 20542) "bench" fftw_spawn_loop (loopmax=3, nthr=3, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7fffffffdd50)
    at threads.c:481
  2    Thread 0x7ffff6d1f700 (LWP 20550) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bff0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  3    Thread 0x7ffff651e700 (LWP 20551) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c0e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  4    Thread 0x7ffff5d1d700 (LWP 20552) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c4b0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  5    Thread 0x7ffff551c700 (LWP 20553) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c250)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  6    Thread 0x7ffff4d1b700 (LWP 20554) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586be40)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  7    Thread 0x7ffff451a700 (LWP 20555) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x555555883560)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  8    Thread 0x7ffff3d19700 (LWP 20556) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x555555883500)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  9    Thread 0x7ffff3518700 (LWP 20557) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bca0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  10   Thread 0x7ffff2d17700 (LWP 20558) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x5555558a0d20)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  11   Thread 0x7ffff2516700 (LWP 20559) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c320)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  12   Thread 0x7ffff1d15700 (LWP 20560) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bea0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  13   Thread 0x7ffff1514700 (LWP 20561) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555587b920)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  14   Thread 0x7ffff0d13700 (LWP 20562) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555588fd20)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  15   Thread 0x7ffff0512700 (LWP 20563) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555588ada0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  16   Thread 0x7fffefd11700 (LWP 20564) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bc20)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
(gdb) c
Continuing.

Thread 1 "bench" hit Breakpoint 1, fftw_spawn_loop (loopmax=6, nthr=6, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7fffffffdb90) at threads.c:481
481 {
(gdb) i th 
  Id   Target Id         Frame 
* 1    Thread 0x7ffff7fb4b80 (LWP 20542) "bench" fftw_spawn_loop (loopmax=6, nthr=6, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7fffffffdb90)
    at threads.c:481
  2    Thread 0x7ffff6d1f700 (LWP 20550) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bff0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  3    Thread 0x7ffff651e700 (LWP 20551) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c0e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  4    Thread 0x7ffff5d1d700 (LWP 20552) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c4b0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  5    Thread 0x7ffff551c700 (LWP 20553) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c250)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  6    Thread 0x7ffff4d1b700 (LWP 20554) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586be40)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  7    Thread 0x7ffff451a700 (LWP 20555) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x555555883560)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  8    Thread 0x7ffff3d19700 (LWP 20556) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x555555883500)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  9    Thread 0x7ffff3518700 (LWP 20557) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bca0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  10   Thread 0x7ffff2d17700 (LWP 20558) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x5555558a0d20)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  11   Thread 0x7ffff2516700 (LWP 20559) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c320)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  12   Thread 0x7ffff1d15700 (LWP 20560) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bea0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  13   Thread 0x7ffff1514700 (LWP 20561) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555587b920)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  14   Thread 0x7ffff0d13700 (LWP 20562) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555588fd20)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  15   Thread 0x7ffff0512700 (LWP 20563) "bench" fftw_spawn_loop (loopmax=6, nthr=6, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7ffff0511e40)
    at threads.c:481
  16   Thread 0x7fffefd11700 (LWP 20564) "bench" fftw_spawn_loop (loopmax=6, nthr=6, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7fffefd10e40)
    at threads.c:481

Here in the latest gdb log, we see that the thread 15 and thread 16 are trying to spawn other threads (2x6 = 12 threads => total 16+12 = 28 threads).

  15   Thread 0x7ffff0512700 (LWP 20563) "bench" fftw_spawn_loop (loopmax=6, nthr=6, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7ffff0511e40)
    at threads.c:481
  16   Thread 0x7fffefd11700 (LWP 20564) "bench" fftw_spawn_loop (loopmax=6, nthr=6, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7fffefd10e40)
    at threads.c:481

The point is, if threads in the worker pool (static struct worker *worker_queue;) are still doing the work given by the main thread, later spawn_loop() by thread 15 and thread 16 will try dequeue() and will find no more worker available in the pool, and then will try to create even new threads: https://github.com/FFTW/fftw3/blob/master/threads/threads.c#L341-L345. That may explain why profiling with perf record and strace gave upto 18 threads spawned in total.

matteo-frigo commented 3 years ago

Hi Minh Quan (I hope I am guessing your first name correctly).

Your report is correct: there are indeed cases where FFTW planner creates more threads than you ask for. It is hard to explain exactly when this happens, but this behavior is ultimately caused by this logic around line 160 in threads/dft-vrank-geq1.c:

 block_size = (d->n + plnr->nthr - 1) / plnr->nthr;
 nthr = (int)((d->n + block_size - 1) / block_size);
 plnr->nthr = (plnr->nthr + nthr - 1) / nthr;

This logic solves the following problem. Assume I want to execute a loop of length D->N using PLNR->NTHR threads (using the original value of PLNR->NTHR, before setting it in the third line). In the normal case, we have D->N >= PLNR->NTHR, so the problem is to split N into chunks of size BLOCK_SIZE, each chunk being executed in one thread.

The hard case is when D->N < PLNR->NTHR. In this case, each iteration of the loop consumes one N and runs in more than one thread, specifically in PLNR->NTHR threads as set by the last line. As you can see, that value is rounded up, which causes the issue you are reporting.

There is no easy way out of this problem. You can round down if you want (here and in a few other places in the threads/ directory), leaving some performance on the table. There is probably no point in attempting to have a nonuniform number of threads (e.g., partition NTHR=10 and N=3 into three loop iterations using 4,3, and 3 threads respectively), because this blows up the planning time for no gain (we would have to plan twice for 4 and 3 threads in this example, instead of just planning for 4 threads, accepting that 12 threads will be created, and call it a day).

How much of a problem is this for you? You are the first one to complain in 20 years, and given that this has performance implications, I am reluctant to just change the code to rounding down.

hominhquan commented 3 years ago

Hi Matteo,

Nice to know you and yes you guessed correctly my first name.

Thanks for your reply and I'd like to sum up my comprehension: The numthreads of each dimension of a "multi-dimensional" problem has been nudged (rounded up) to a same value as much as possible, so that the same plan can be used for them all and saving planning time. And we accept it to spawn more threads. If this is correct, so yes I agree that nudging the numthreads is worth in those cases.

Now comes my context: I'm working on porting FFTW on Kalray MPPA - an embedded DSP-like many-core processor. Each one of our compute units is like an SMP-16 CPU, and the underlying embedded OS supports at maximum 16 threads (one per core). So the fact that FFTW planner spawns more than 16 threads in some cases raised execution error on my architecture. Generally speaking, resource and footprint determinism is my most important concern.

I'm thinking about changing a bit the current protocol of the POSIX threads.c backend: the main thread(s) no longer enqueue/dequeue workers (aka threads) from the worker_queue. Instead, jobs (currently work's) will be pushed into a new common job_queue, then other workers (created at initialization, in fftw_init_threads(int nthreads) for example) will pop jobs from the job_queue and process them. In fact, we switch from "moving workers around" to "moving jobs around". This would allow better load-balancing and support possible "hierarchical" spawn_loop's like the case of orf256x256 on embedded platforms with non-linux OS. Can you give your opinion on that ?

Anyway, I was a bit happily surprised to be the one who noticed that since 20 years :)