Closed coryan closed 9 years ago
For a cross-correlation with 2048 points I am getting (all percentile numbers represent microseconds):
test case | min | p25 | p50 | p75 | p90 | p99 | p99.9 | max | N |
---|---|---|---|---|---|---|---|---|---|
double:aligned | 39 | 39 | 39 | 39 | 39 | 43 | 50 | 659 | 1000000 |
float:aligned | 27 | 27 | 27 | 27 | 27 | 30 | 31 | 109 | 1000000 |
double:unaligned | 48 | 49 | 49 | 49 | 49 | 53 | 54 | 971 | 1000000 |
float:unaligned | 45 | 45 | 45 | 45 | 45 | 48 | 49 | 182 | 1000000 |
It is clear that using SIMD instructions is key (which depend on alignment), and floating precision matters a lot too. A full dump of the test run follows:
# ... set the CPU power management to optimize for performance ... sudo cpupower frequency-set -g performance Setting cpu: 0 Setting cpu: 1 Setting cpu: 2 Setting cpu: 3 # ... make sure we reset the trap "sudo cpupower frequency-set -g ondemand" 0 # ... run different iterations of the benchmark ... for test in double:aligned float:aligned double:unaligned float:unaligned; do echo --- ./jb/fftw/bm_time_delay_estimator --test-case=${test} --iterations=1000000 done --- Configuration for test iterations: 1000000 size: 0 warmup-iterations: 100 verbose: false test-case: double:aligned double:aligned summary min=39, p25=39, p50=39, p75=39, p90=39, p99=43, p99.9=50, max=659, N=1000000 --- Configuration for test iterations: 1000000 size: 0 warmup-iterations: 100 verbose: false test-case: float:aligned float:aligned summary min=27, p25=27, p50=27, p75=27, p90=27, p99=30, p99.9=31, max=109, N=1000000 --- Configuration for test iterations: 1000000 size: 0 warmup-iterations: 100 verbose: false test-case: double:unaligned double:unaligned summary min=49, p25=49, p50=49, p75=49, p90=49, p99=53, p99.9=54, max=971, N=1000000 --- Configuration for test iterations: 1000000 size: 0 warmup-iterations: 100 verbose: false test-case: float:unaligned float:unaligned summary min=45, p25=45, p50=45, p75=45, p90=45, p99=48, p99.9=49, max=182, N=1000000 # ... print out compilation details ... cat jb/testing/compile_info.cpp #includechar const jb::testing::uname_a[] = R"""( Linux glamdring.home 4.1.3-100.fc21.x86_64 #1 SMP Wed Jul 29 18:59:46 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux )"""; char const jb::testing::compiler[] = R"""( clang version 3.5.0 (tags/RELEASE_350/final) Target: x86_64-redhat-linux-gnu Thread model: posix )"""; char const jb::testing::compiler_flags[] = R"""( -O3 -std=c++11 -Wall -Qunused-arguments )"""; char const jb::testing::linker[] = R"""( GNU ld version 2.24 Copyright 2013 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License version 3 or (at your option) a later version. This program has absolutely no warranty. )"""; # ... print out the available memory ... free total used free shared buff/cache available Mem: 15893640 8232960 120676 211168 7540004 7287676 Swap: 8003580 1517012 6486568 # ... print out how many CPUs (or cores or whatevers) ... egrep ^processor /proc/cpuinfo processor : 0 processor : 1 processor : 2 processor : 3 # ... print out the details of the first CPU ... cat /proc/cpuinfo | awk '{print $0; if ($0 == "") { exit; }}' processor : 0 vendor_id : AuthenticAMD cpu family : 18 model : 1 model name : AMD A8-3870 APU with Radeon(tm) HD Graphics stepping : 0 microcode : 0x3000027 cpu MHz : 3000.000 cache size : 1024 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt arat hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall bugs : fxsave_leak sysret_ss_attrs bogomips : 6000.50 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate exit 0 sudo cpupower frequency-set -g ondemand Setting cpu: 0 Setting cpu: 1 Setting cpu: 2 Setting cpu: 3
Also obtained results for g++:
test case | min | p25 | p50 | p75 | p90 | p99 | p99.9 | max | N |
---|---|---|---|---|---|---|---|---|---|
double:aligned | 41 | 41 | 41 | 41 | 42 | 44 | 47 | 967 | 1000000 |
float:aligned | 57 | 57 | 57 | 57 | 58 | 60 | 67 | 1083 | 1000000 |
double:unaligned | 52 | 53 | 53 | 54 | 54 | 57 | 58 | 1120 | 1000000 |
float:unaligned | 74 | 75 | 75 | 75 | 75 | 77 | 82 | 3456 | 1000000 |
# ... set the CPU power management to optimize for performance ... sudo cpupower frequency-set -g performance Setting cpu: 0 Setting cpu: 1 Setting cpu: 2 Setting cpu: 3 # ... make sure we reset the trap "sudo cpupower frequency-set -g ondemand" 0 # ... run different iterations of the benchmark ... for test in double:aligned float:aligned double:unaligned float:unaligned; do echo --- ./jb/fftw/bm_time_delay_estimator --test-case=${test} --iterations=1000000 done --- Configuration for test iterations: 1000000 warmup-iterations: 100 size: 0 verbose: false test-case: double:aligned double:aligned summary min=41, p25=41, p50=41, p75=41, p90=42, p99=44, p99.9=47, max=967, N=1000000 --- Configuration for test iterations: 1000000 warmup-iterations: 100 size: 0 verbose: false test-case: float:aligned float:aligned summary min=57, p25=57, p50=57, p75=57, p90=58, p99=60, p99.9=67, max=1083, N=1000000 --- Configuration for test iterations: 1000000 warmup-iterations: 100 size: 0 verbose: false test-case: double:unaligned double:unaligned summary min=52, p25=53, p50=53, p75=54, p90=54, p99=57, p99.9=58, max=1120, N=1000000 --- Configuration for test iterations: 1000000 warmup-iterations: 100 size: 0 verbose: false test-case: float:unaligned float:unaligned summary min=74, p25=75, p50=75, p75=75, p90=75, p99=77, p99.9=82, max=3456, N=1000000 # ... print out the current git revision ... git rev-parse HEAD 5f5b82cb0ce9b86ac89c606440b4d7a479f8aa7d # ... print out compilation details ... cat jb/testing/compile_info.cpp #includechar const jb::testing::uname_a[] = R"""( Linux glamdring.home 4.1.3-100.fc21.x86_64 #1 SMP Wed Jul 29 18:59:46 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux )"""; char const jb::testing::compiler[] = R"""( g++ (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6) Copyright (C) 2014 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. )"""; char const jb::testing::compiler_flags[] = R"""( -O3 -std=c++11 -Wall )"""; char const jb::testing::linker[] = R"""( GNU ld version 2.24 Copyright 2013 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License version 3 or (at your option) a later version. This program has absolutely no warranty. )"""; # ... print out the available memory ... free total used free shared buff/cache available Mem: 15893640 8337520 852684 195188 6703436 7191180 Swap: 8003580 1516412 6487168 # ... print out how many CPUs (or cores or whatevers) ... egrep ^processor /proc/cpuinfo processor : 0 processor : 1 processor : 2 processor : 3 # ... print out the details of the first CPU ... cat /proc/cpuinfo | awk '{print $0; if ($0 == "") { exit; }}' processor : 0 vendor_id : AuthenticAMD cpu family : 18 model : 1 model name : AMD A8-3870 APU with Radeon(tm) HD Graphics stepping : 0 microcode : 0x3000027 cpu MHz : 3000.000 cache size : 1024 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt arat hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall bugs : fxsave_leak sysret_ss_attrs bogomips : 6000.50 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate exit 0 sudo cpupower frequency-set -g ondemand Setting cpu: 0 Setting cpu: 1 Setting cpu: 2 Setting cpu: 3
At the end of this task we will have an implementation of a time delay estimator that uses cross-correlation to estimate the delay, and uses the FFTW library to compute the FFT and FFT inverse.
If possible, we will try to vectorize the computation of the maximum value.
The objective is to get some empirical numbers about computational costs.