Open hominhquan opened 3 years ago
I saw that with orf256x256
, the thread 0 spawned 15 other theads in the first X(spawn_loop)
, then new threads (I don't know why) also spawn others threads. It can be checked by putting a breakpoint at fftw_spawn_loop
:
$ gdb -- ./bench -oestimate -onthreads=16 -v1 --verify orf256x256
Excess command line arguments ignored. (-onthreads=16 ...)
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./bench...done.
(gdb) #r -oestimate -onthreads=16 -v1 --verify orf256x256
(gdb) b fftw_spawn_loop
Breakpoint 1 at 0xc4c0: file threads.c, line 481.
(gdb) r -oestimate -onthreads=16 -v1 --verify orf256x256
Starting program: /work1/mqho/BU_EMB/KAF_libraries/fftw_bug/fftw-3.3.9/tests/bench -oestimate -onthreads=16 -v1 --verify orf256x256
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, fftw_spawn_loop (loopmax=16, nthr=16, proc=proc@entry=0x555555562690 <spawn_apply>, data=data@entry=0x7fffffffddc0) at threads.c:481
481 {
(gdb) i th
Id Target Id Frame
* 1 Thread 0x7ffff7fb4b80 (LWP 20542) "bench" fftw_spawn_loop (loopmax=16, nthr=16, proc=proc@entry=0x555555562690 <spawn_apply>, data=data@entry=0x7fffffffddc0)
at threads.c:481
(gdb) c
Continuing.
[New Thread 0x7ffff6d1f700 (LWP 20550)]
Core -134526080 created 1 threads
[New Thread 0x7ffff651e700 (LWP 20551)]
Core -134526080 created 2 threads
[New Thread 0x7ffff5d1d700 (LWP 20552)]
Core -134526080 created 3 threads
[New Thread 0x7ffff551c700 (LWP 20553)]
Core -134526080 created 4 threads
[New Thread 0x7ffff4d1b700 (LWP 20554)]
Core -134526080 created 5 threads
[New Thread 0x7ffff451a700 (LWP 20555)]
Core -134526080 created 6 threads
[New Thread 0x7ffff3d19700 (LWP 20556)]
Core -134526080 created 7 threads
[New Thread 0x7ffff3518700 (LWP 20557)]
Core -134526080 created 8 threads
[New Thread 0x7ffff2d17700 (LWP 20558)]
Core -134526080 created 9 threads
[New Thread 0x7ffff2516700 (LWP 20559)]
Core -134526080 created 10 threads
[New Thread 0x7ffff1d15700 (LWP 20560)]
Core -134526080 created 11 threads
[New Thread 0x7ffff1514700 (LWP 20561)]
Core -134526080 created 12 threads
[New Thread 0x7ffff0d13700 (LWP 20562)]
Core -134526080 created 13 threads
[New Thread 0x7ffff0512700 (LWP 20563)]
Core -134526080 created 14 threads
[New Thread 0x7fffefd11700 (LWP 20564)]
Core -134526080 created 15 threads
Thread 1 "bench" hit Breakpoint 1, fftw_spawn_loop (loopmax=3, nthr=3, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7fffffffdd50) at threads.c:481
481 {
(gdb) i th
Id Target Id Frame
* 1 Thread 0x7ffff7fb4b80 (LWP 20542) "bench" fftw_spawn_loop (loopmax=3, nthr=3, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7fffffffdd50)
at threads.c:481
2 Thread 0x7ffff6d1f700 (LWP 20550) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bff0)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
3 Thread 0x7ffff651e700 (LWP 20551) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c0e0)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
4 Thread 0x7ffff5d1d700 (LWP 20552) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c4b0)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
5 Thread 0x7ffff551c700 (LWP 20553) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c250)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
6 Thread 0x7ffff4d1b700 (LWP 20554) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586be40)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
7 Thread 0x7ffff451a700 (LWP 20555) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x555555883560)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
8 Thread 0x7ffff3d19700 (LWP 20556) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x555555883500)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
9 Thread 0x7ffff3518700 (LWP 20557) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bca0)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
10 Thread 0x7ffff2d17700 (LWP 20558) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x5555558a0d20)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
11 Thread 0x7ffff2516700 (LWP 20559) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c320)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
12 Thread 0x7ffff1d15700 (LWP 20560) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bea0)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
13 Thread 0x7ffff1514700 (LWP 20561) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555587b920)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
14 Thread 0x7ffff0d13700 (LWP 20562) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555588fd20)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
15 Thread 0x7ffff0512700 (LWP 20563) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555588ada0)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
16 Thread 0x7fffefd11700 (LWP 20564) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bc20)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
(gdb) c
Continuing.
Thread 1 "bench" hit Breakpoint 1, fftw_spawn_loop (loopmax=6, nthr=6, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7fffffffdb90) at threads.c:481
481 {
(gdb) i th
Id Target Id Frame
* 1 Thread 0x7ffff7fb4b80 (LWP 20542) "bench" fftw_spawn_loop (loopmax=6, nthr=6, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7fffffffdb90)
at threads.c:481
2 Thread 0x7ffff6d1f700 (LWP 20550) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bff0)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
3 Thread 0x7ffff651e700 (LWP 20551) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c0e0)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
4 Thread 0x7ffff5d1d700 (LWP 20552) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c4b0)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
5 Thread 0x7ffff551c700 (LWP 20553) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c250)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
6 Thread 0x7ffff4d1b700 (LWP 20554) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586be40)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
7 Thread 0x7ffff451a700 (LWP 20555) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x555555883560)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
8 Thread 0x7ffff3d19700 (LWP 20556) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x555555883500)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
9 Thread 0x7ffff3518700 (LWP 20557) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bca0)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
10 Thread 0x7ffff2d17700 (LWP 20558) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x5555558a0d20)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
11 Thread 0x7ffff2516700 (LWP 20559) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586c320)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
12 Thread 0x7ffff1d15700 (LWP 20560) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555586bea0)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
13 Thread 0x7ffff1514700 (LWP 20561) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555587b920)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
14 Thread 0x7ffff0d13700 (LWP 20562) "bench" 0x00007ffff78286e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555588fd20)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
15 Thread 0x7ffff0512700 (LWP 20563) "bench" fftw_spawn_loop (loopmax=6, nthr=6, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7ffff0511e40)
at threads.c:481
16 Thread 0x7fffefd11700 (LWP 20564) "bench" fftw_spawn_loop (loopmax=6, nthr=6, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7fffefd10e40)
at threads.c:481
Here in the latest gdb log, we see that the thread 15 and thread 16 are trying to spawn other threads (2x6 = 12 threads => total 16+12 = 28 threads).
15 Thread 0x7ffff0512700 (LWP 20563) "bench" fftw_spawn_loop (loopmax=6, nthr=6, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7ffff0511e40)
at threads.c:481
16 Thread 0x7fffefd11700 (LWP 20564) "bench" fftw_spawn_loop (loopmax=6, nthr=6, proc=proc@entry=0x555555560950 <spawn_apply>, data=data@entry=0x7fffefd10e40)
at threads.c:481
The point is, if threads in the worker pool (static struct worker *worker_queue;
) are still doing the work given by the main thread, later spawn_loop()
by thread 15 and thread 16 will try dequeue()
and will find no more worker available in the pool, and then will try to create even new threads: https://github.com/FFTW/fftw3/blob/master/threads/threads.c#L341-L345. That may explain why profiling with perf record
and strace
gave upto 18 threads spawned in total.
Hi Minh Quan (I hope I am guessing your first name correctly).
Your report is correct: there are indeed cases where FFTW planner creates more threads than you ask for. It is hard to explain exactly when this happens, but this behavior is ultimately caused by this logic around line 160 in threads/dft-vrank-geq1.c:
block_size = (d->n + plnr->nthr - 1) / plnr->nthr;
nthr = (int)((d->n + block_size - 1) / block_size);
plnr->nthr = (plnr->nthr + nthr - 1) / nthr;
This logic solves the following problem. Assume I want to execute a loop of length D->N using PLNR->NTHR threads (using the original value of PLNR->NTHR, before setting it in the third line). In the normal case, we have D->N >= PLNR->NTHR, so the problem is to split N into chunks of size BLOCK_SIZE, each chunk being executed in one thread.
The hard case is when D->N < PLNR->NTHR. In this case, each iteration of the loop consumes one N and runs in more than one thread, specifically in PLNR->NTHR threads as set by the last line. As you can see, that value is rounded up, which causes the issue you are reporting.
There is no easy way out of this problem. You can round down if you want (here and in a few other places in the threads/ directory), leaving some performance on the table. There is probably no point in attempting to have a nonuniform number of threads (e.g., partition NTHR=10 and N=3 into three loop iterations using 4,3, and 3 threads respectively), because this blows up the planning time for no gain (we would have to plan twice for 4 and 3 threads in this example, instead of just planning for 4 threads, accepting that 12 threads will be created, and call it a day).
How much of a problem is this for you? You are the first one to complain in 20 years, and given that this has performance implications, I am reluctant to just change the code to rounding down.
Hi Matteo,
Nice to know you and yes you guessed correctly my first name.
Thanks for your reply and I'd like to sum up my comprehension: The numthreads of each dimension of a "multi-dimensional" problem has been nudged (rounded up) to a same value as much as possible, so that the same plan can be used for them all and saving planning time. And we accept it to spawn more threads. If this is correct, so yes I agree that nudging the numthreads is worth in those cases.
Now comes my context: I'm working on porting FFTW on Kalray MPPA - an embedded DSP-like many-core processor. Each one of our compute units is like an SMP-16 CPU, and the underlying embedded OS supports at maximum 16 threads (one per core). So the fact that FFTW planner spawns more than 16 threads in some cases raised execution error on my architecture. Generally speaking, resource and footprint determinism is my most important concern.
I'm thinking about changing a bit the current protocol of the POSIX threads.c backend: the main thread(s) no longer enqueue/dequeue workers (aka threads) from the worker_queue
. Instead, jobs (currently work
's) will be pushed into a new common job_queue
, then other workers (created at initialization, in fftw_init_threads(int nthreads)
for example) will pop jobs from the job_queue
and process them. In fact, we switch from "moving workers around" to "moving jobs around". This would allow better load-balancing and support possible "hierarchical" spawn_loop
's like the case of orf256x256
on embedded platforms with non-linux OS. Can you give your opinion on that ?
Anyway, I was a bit happily surprised to be the one who noticed that since 20 years :)
Config:
Problem: With the default benchmark/testing program
./tests/bench,
I fall into a case where there are more threads created than given by-onthreads=<N>
. There are 18 threads created over 16 given.Another tracking with
perf record
:And with
strace
: