coryan / jaybeams

JayBeams: A Project to have fun Coding, and maybe measure relative delays in market feeds
Apache License 2.0
2 stars 2 forks source link

Implement a time delay estimator based on FFTW #7

Closed coryan closed 9 years ago

coryan commented 9 years ago

At the end of this task we will have an implementation of a time delay estimator that uses cross-correlation to estimate the delay, and uses the FFTW library to compute the FFT and FFT inverse.

If possible, we will try to vectorize the computation of the maximum value.

The objective is to get some empirical numbers about computational costs.

coryan commented 9 years ago

For a cross-correlation with 2048 points I am getting (all percentile numbers represent microseconds):

test case min p25 p50 p75 p90 p99 p99.9 max N
double:aligned 39 39 39 39 39 43 50 659 1000000
float:aligned 27 27 27 27 27 30 31 109 1000000
double:unaligned 48 49 49 49 49 53 54 971 1000000
float:unaligned 45 45 45 45 45 48 49 182 1000000

It is clear that using SIMD instructions is key (which depend on alignment), and floating precision matters a lot too. A full dump of the test run follows:

# ... set the CPU power management to optimize for performance ...
sudo cpupower frequency-set -g performance
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
# ... make sure we reset the 
trap "sudo cpupower frequency-set -g ondemand" 0
# ... run different iterations of the benchmark ...
for test in double:aligned float:aligned double:unaligned float:unaligned; do
    echo ---
    ./jb/fftw/bm_time_delay_estimator --test-case=${test} --iterations=1000000
done
---
Configuration for test
iterations: 1000000
size: 0
warmup-iterations: 100
verbose: false
test-case: double:aligned
double:aligned summary min=39, p25=39, p50=39, p75=39, p90=39, p99=43, p99.9=50, max=659, N=1000000
---
Configuration for test
iterations: 1000000
size: 0
warmup-iterations: 100
verbose: false
test-case: float:aligned
float:aligned summary min=27, p25=27, p50=27, p75=27, p90=27, p99=30, p99.9=31, max=109, N=1000000
---
Configuration for test
iterations: 1000000
size: 0
warmup-iterations: 100
verbose: false
test-case: double:unaligned
double:unaligned summary min=49, p25=49, p50=49, p75=49, p90=49, p99=53, p99.9=54, max=971, N=1000000
---
Configuration for test
iterations: 1000000
size: 0
warmup-iterations: 100
verbose: false
test-case: float:unaligned
float:unaligned summary min=45, p25=45, p50=45, p75=45, p90=45, p99=48, p99.9=49, max=182, N=1000000
# ... print out compilation details ...
cat jb/testing/compile_info.cpp
#include 
char const jb::testing::uname_a[] = R"""(
Linux glamdring.home 4.1.3-100.fc21.x86_64 #1 SMP Wed Jul 29 18:59:46 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
)""";
char const jb::testing::compiler[] = R"""(
clang version 3.5.0 (tags/RELEASE_350/final)
Target: x86_64-redhat-linux-gnu
Thread model: posix
)""";
char const jb::testing::compiler_flags[] = R"""(
-O3 -std=c++11 -Wall -Qunused-arguments
)""";
char const jb::testing::linker[] = R"""(
GNU ld version 2.24
Copyright 2013 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.
)""";
# ... print out the available memory ...
free
              total        used        free      shared  buff/cache   available
Mem:       15893640     8232960      120676      211168     7540004     7287676
Swap:       8003580     1517012     6486568
# ... print out how many CPUs (or cores or whatevers) ...
egrep ^processor /proc/cpuinfo 
processor   : 0
processor   : 1
processor   : 2
processor   : 3
# ... print out the details of the first CPU ...
cat /proc/cpuinfo  | awk '{print $0; if ($0 == "") { exit; }}'
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 18
model       : 1
model name  : AMD A8-3870 APU with Radeon(tm) HD Graphics
stepping    : 0
microcode   : 0x3000027
cpu MHz     : 3000.000
cache size  : 1024 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 6
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt arat hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall
bugs        : fxsave_leak sysret_ss_attrs
bogomips    : 6000.50
TLB size    : 1536 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
exit 0
sudo cpupower frequency-set -g ondemand
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
coryan commented 9 years ago

Also obtained results for g++:

test case min p25 p50 p75 p90 p99 p99.9 max N
double:aligned 41 41 41 41 42 44 47 967 1000000
float:aligned 57 57 57 57 58 60 67 1083 1000000
double:unaligned 52 53 53 54 54 57 58 1120 1000000
float:unaligned 74 75 75 75 75 77 82 3456 1000000
# ... set the CPU power management to optimize for performance ...
sudo cpupower frequency-set -g performance
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
# ... make sure we reset the 
trap "sudo cpupower frequency-set -g ondemand" 0
# ... run different iterations of the benchmark ...
for test in double:aligned float:aligned double:unaligned float:unaligned; do
    echo ---
    ./jb/fftw/bm_time_delay_estimator --test-case=${test} --iterations=1000000
done
---
Configuration for test
iterations: 1000000
warmup-iterations: 100
size: 0
verbose: false
test-case: double:aligned
double:aligned summary min=41, p25=41, p50=41, p75=41, p90=42, p99=44, p99.9=47, max=967, N=1000000
---
Configuration for test
iterations: 1000000
warmup-iterations: 100
size: 0
verbose: false
test-case: float:aligned
float:aligned summary min=57, p25=57, p50=57, p75=57, p90=58, p99=60, p99.9=67, max=1083, N=1000000
---
Configuration for test
iterations: 1000000
warmup-iterations: 100
size: 0
verbose: false
test-case: double:unaligned
double:unaligned summary min=52, p25=53, p50=53, p75=54, p90=54, p99=57, p99.9=58, max=1120, N=1000000
---
Configuration for test
iterations: 1000000
warmup-iterations: 100
size: 0
verbose: false
test-case: float:unaligned
float:unaligned summary min=74, p25=75, p50=75, p75=75, p90=75, p99=77, p99.9=82, max=3456, N=1000000
# ... print out the current git revision ...
git rev-parse HEAD
5f5b82cb0ce9b86ac89c606440b4d7a479f8aa7d
# ... print out compilation details ...
cat jb/testing/compile_info.cpp
#include 
char const jb::testing::uname_a[] = R"""(
Linux glamdring.home 4.1.3-100.fc21.x86_64 #1 SMP Wed Jul 29 18:59:46 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
)""";
char const jb::testing::compiler[] = R"""(
g++ (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6)
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
)""";
char const jb::testing::compiler_flags[] = R"""(
-O3 -std=c++11 -Wall
)""";
char const jb::testing::linker[] = R"""(
GNU ld version 2.24
Copyright 2013 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.
)""";
# ... print out the available memory ...
free
              total        used        free      shared  buff/cache   available
Mem:       15893640     8337520      852684      195188     6703436     7191180
Swap:       8003580     1516412     6487168
# ... print out how many CPUs (or cores or whatevers) ...
egrep ^processor /proc/cpuinfo 
processor   : 0
processor   : 1
processor   : 2
processor   : 3
# ... print out the details of the first CPU ...
cat /proc/cpuinfo  | awk '{print $0; if ($0 == "") { exit; }}'
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 18
model       : 1
model name  : AMD A8-3870 APU with Radeon(tm) HD Graphics
stepping    : 0
microcode   : 0x3000027
cpu MHz     : 3000.000
cache size  : 1024 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 6
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt arat hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall
bugs        : fxsave_leak sysret_ss_attrs
bogomips    : 6000.50
TLB size    : 1536 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
exit 0
sudo cpupower frequency-set -g ondemand
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3