Implement a time delay estimator based on FFTW

coryan commented 9 years ago

At the end of this task we will have an implementation of a time delay estimator that uses cross-correlation to estimate the delay, and uses the FFTW library to compute the FFT and FFT inverse.

If possible, we will try to vectorize the computation of the maximum value.

The objective is to get some empirical numbers about computational costs.

coryan commented 9 years ago

For a cross-correlation with 2048 points I am getting (all percentile numbers represent microseconds):

test case	min	p25	p50	p75	p90	p99	p99.9	max	N
double:aligned	39	39	39	39	39	43	50	659	1000000
float:aligned	27	27	27	27	27	30	31	109	1000000
double:unaligned	48	49	49	49	49	53	54	971	1000000
float:unaligned	45	45	45	45	45	48	49	182	1000000

It is clear that using SIMD instructions is key (which depend on alignment), and floating precision matters a lot too. A full dump of the test run follows:

# ... set the CPU power management to optimize for performance ...
sudo cpupower frequency-set -g performance
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
# ... make sure we reset the 
trap "sudo cpupower frequency-set -g ondemand" 0
# ... run different iterations of the benchmark ...
for test in double:aligned float:aligned double:unaligned float:unaligned; do
    echo ---
    ./jb/fftw/bm_time_delay_estimator --test-case=${test} --iterations=1000000
done
---
Configuration for test
iterations: 1000000
size: 0
warmup-iterations: 100
verbose: false
test-case: double:aligned
double:aligned summary min=39, p25=39, p50=39, p75=39, p90=39, p99=43, p99.9=50, max=659, N=1000000
---
Configuration for test
iterations: 1000000
size: 0
warmup-iterations: 100
verbose: false
test-case: float:aligned
float:aligned summary min=27, p25=27, p50=27, p75=27, p90=27, p99=30, p99.9=31, max=109, N=1000000
---
Configuration for test
iterations: 1000000
size: 0
warmup-iterations: 100
verbose: false
test-case: double:unaligned
double:unaligned summary min=49, p25=49, p50=49, p75=49, p90=49, p99=53, p99.9=54, max=971, N=1000000
---
Configuration for test
iterations: 1000000
size: 0
warmup-iterations: 100
verbose: false
test-case: float:unaligned
float:unaligned summary min=45, p25=45, p50=45, p75=45, p90=45, p99=48, p99.9=49, max=182, N=1000000
# ... print out compilation details ...
cat jb/testing/compile_info.cpp
#include 
char const jb::testing::uname_a[] = R"""(
Linux glamdring.home 4.1.3-100.fc21.x86_64 #1 SMP Wed Jul 29 18:59:46 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
)""";
char const jb::testing::compiler[] = R"""(
clang version 3.5.0 (tags/RELEASE_350/final)
Target: x86_64-redhat-linux-gnu
Thread model: posix
)""";
char const jb::testing::compiler_flags[] = R"""(
-O3 -std=c++11 -Wall -Qunused-arguments
)""";
char const jb::testing::linker[] = R"""(
GNU ld version 2.24
Copyright 2013 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.
)""";
# ... print out the available memory ...
free
              total        used        free      shared  buff/cache   available
Mem:       15893640     8232960      120676      211168     7540004     7287676
Swap:       8003580     1517012     6486568
# ... print out how many CPUs (or cores or whatevers) ...
egrep ^processor /proc/cpuinfo 
processor   : 0
processor   : 1
processor   : 2
processor   : 3
# ... print out the details of the first CPU ...
cat /proc/cpuinfo  | awk '{print $0; if ($0 == "") { exit; }}'
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 18
model       : 1
model name  : AMD A8-3870 APU with Radeon(tm) HD Graphics
stepping    : 0
microcode   : 0x3000027
cpu MHz     : 3000.000
cache size  : 1024 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 6
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt arat hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall
bugs        : fxsave_leak sysret_ss_attrs
bogomips    : 6000.50
TLB size    : 1536 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
exit 0
sudo cpupower frequency-set -g ondemand
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3

coryan commented 9 years ago

Also obtained results for g++:

test case	min	p25	p50	p75	p90	p99	p99.9	max	N
double:aligned	41	41	41	41	42	44	47	967	1000000
float:aligned	57	57	57	57	58	60	67	1083	1000000
double:unaligned	52	53	53	54	54	57	58	1120	1000000
float:unaligned	74	75	75	75	75	77	82	3456	1000000

# ... set the CPU power management to optimize for performance ...
sudo cpupower frequency-set -g performance
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
# ... make sure we reset the 
trap "sudo cpupower frequency-set -g ondemand" 0
# ... run different iterations of the benchmark ...
for test in double:aligned float:aligned double:unaligned float:unaligned; do
    echo ---
    ./jb/fftw/bm_time_delay_estimator --test-case=${test} --iterations=1000000
done
---
Configuration for test
iterations: 1000000
warmup-iterations: 100
size: 0
verbose: false
test-case: double:aligned
double:aligned summary min=41, p25=41, p50=41, p75=41, p90=42, p99=44, p99.9=47, max=967, N=1000000
---
Configuration for test
iterations: 1000000
warmup-iterations: 100
size: 0
verbose: false
test-case: float:aligned
float:aligned summary min=57, p25=57, p50=57, p75=57, p90=58, p99=60, p99.9=67, max=1083, N=1000000
---
Configuration for test
iterations: 1000000
warmup-iterations: 100
size: 0
verbose: false
test-case: double:unaligned
double:unaligned summary min=52, p25=53, p50=53, p75=54, p90=54, p99=57, p99.9=58, max=1120, N=1000000
---
Configuration for test
iterations: 1000000
warmup-iterations: 100
size: 0
verbose: false
test-case: float:unaligned
float:unaligned summary min=74, p25=75, p50=75, p75=75, p90=75, p99=77, p99.9=82, max=3456, N=1000000
# ... print out the current git revision ...
git rev-parse HEAD
5f5b82cb0ce9b86ac89c606440b4d7a479f8aa7d
# ... print out compilation details ...
cat jb/testing/compile_info.cpp
#include 
char const jb::testing::uname_a[] = R"""(
Linux glamdring.home 4.1.3-100.fc21.x86_64 #1 SMP Wed Jul 29 18:59:46 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
)""";
char const jb::testing::compiler[] = R"""(
g++ (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6)
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
)""";
char const jb::testing::compiler_flags[] = R"""(
-O3 -std=c++11 -Wall
)""";
char const jb::testing::linker[] = R"""(
GNU ld version 2.24
Copyright 2013 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.
)""";
# ... print out the available memory ...
free
              total        used        free      shared  buff/cache   available
Mem:       15893640     8337520      852684      195188     6703436     7191180
Swap:       8003580     1516412     6487168
# ... print out how many CPUs (or cores or whatevers) ...
egrep ^processor /proc/cpuinfo 
processor   : 0
processor   : 1
processor   : 2
processor   : 3
# ... print out the details of the first CPU ...
cat /proc/cpuinfo  | awk '{print $0; if ($0 == "") { exit; }}'
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 18
model       : 1
model name  : AMD A8-3870 APU with Radeon(tm) HD Graphics
stepping    : 0
microcode   : 0x3000027
cpu MHz     : 3000.000
cache size  : 1024 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 6
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt arat hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall
bugs        : fxsave_leak sysret_ss_attrs
bogomips    : 6000.50
TLB size    : 1536 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
exit 0
sudo cpupower frequency-set -g ondemand
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3

coryan / jaybeams

Implement a time delay estimator based on FFTW #7