Closed athoik closed 6 years ago
I was intrigued so I spent some time trying to get code from the spiral website and my NEON version to work together.
Some quick performance results:
rate = 1/3, K = 7, frame size = 2048
nrsc5-NEON
Execution time for 10000 2054-bit frames: 4.96 sec
decoder speed: 4129.03 kbits/s
Spiral C
Execution time for 10000 2054-bit frames: 12.29 sec
decoder speed: 1666.4 kbits/s
Spiral SSE 4-way (using SSE2NEON)
Execution time for 10000 2054-bit frames: 9.73 sec
decoder speed: 2104.83 kbits/s
Spiral SSE 8-way (using SSE2NEON)
Execution time for 10000 2054-bit frames: 5.16 sec
decoder speed: 3968.99 kbits/s
rate = 1/4, K = 7, frame size = 2048
nrsc5-NEON
Execution time for 100000 2054-bit frames: 45.71 sec
decoder speed: 4480.42 kbits/s
Spiral SSE 8-way (using SSE2NEON)
Execution time for 100000 2054-bit frames: 55.93 sec
decoder speed: 3661.72 kbits/s
tl;dr If you are currently using the SSE 8-way from Spiral for NEON, then you may get a minor performance improvement. Obviously, parameters and such may affect the benefit.
My test code is attached: test-nrsc5-viterbi.zip. I also tested on x86-64, in that case the Spiral code is significantly faster than what is currently in nrsc5.
Just to inform.
In DAB each DABFrame (app 10 a second) there are 4 FIC's, each 3072 elements. The data contents - usually one segment per DABframe - also has to pass the deconvolutional decoding. Different audio segments have different sizes, most of them with size smaller than 3072, some of them longer. DAB uses 1/4 K=7
2018-02-24 23:43 GMT+01:00 Andrew Wesie notifications@github.com:
I was intrigued so I spent some time trying to get code from the spiral website and my NEON version to work together.
Some quick performance results:
rate = 1/3, K = 7, frame size = 2048
nrsc5-NEON Execution time for 10000 2054-bit frames: 4.96 sec decoder speed: 4129.03 kbits/s
Spiral C Execution time for 10000 2054-bit frames: 12.29 sec decoder speed: 1666.4 kbits/s
Spiral SSE 4-way (using SSE2NEON) Execution time for 10000 2054-bit frames: 9.73 sec decoder speed: 2104.83 kbits/s
Spiral SSE 8-way (using SSE2NEON) Execution time for 10000 2054-bit frames: 5.16 sec decoder speed: 3968.99 kbits/s
rate = 1/4, K = 7, frame size = 2048
nrsc5-NEON Execution time for 100000 2054-bit frames: 45.71 sec decoder speed: 4480.42 kbits/s
Spiral SSE 8-way (using SSE2NEON) Execution time for 100000 2054-bit frames: 55.93 sec decoder speed: 3661.72 kbits/s
tl;dr If you are currently using the SSE 8-way from Spiral for NEON, then you may get a minor performance improvement. Obviously, parameters and such may affect the benefit.
My test code is attached: test-nrsc5-viterbi.zip https://github.com/JvanKatwijk/dab-cmdline/files/1755395/test-nrsc5-viterbi.zip. I also tested on x86-64, in that case the Spiral code is significantly faster than what is currently in nrsc5.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/42#issuecomment-368266184, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwJAAfxZdX7joNy3G8KNpngSa7HZfks5tYJCmgaJpZM4SR4rl .
-- Jan van Katwijk
+31 (0)15 3698980 +31 (0) 628260355
Dear @awesie,
Thanks for your sample!
Indeed your version of viterbi seems to produce better results!
Spiral C
Execution time for 10000 2054-bit frames: 9.01 sec
decoder speed: 2272.27 kbits/s
Spiral SSE 8-way (using SSE2NEON)
Execution time for 100000 2054-bit frames: 40.30 sec
decoder speed: 5082.01 kbits/s
nrsc5-NEON
Execution time for 100000 2054-bit frames: 36.25 sec
decoder speed: 5649.34 kbits/s
Getting almost 150% improvement over C version with nrsc5 is excellent!
The speed of Spiral SSE-16 way on my computer is really high (running under oracle vbox) , but on a Xeon CPU (bare metal) is "astronomical"!
Spiral SSE 16-way (on Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz)
Execution time for 10000 2054-bit frames: 0.28 sec
decoder speed: 73142.9 kbits/s
Spiral SSE 16-way (on Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz)
Execution time for 10000 2054-bit frames: 0.61 sec
decoder speed: 33684.2 kbits/s
I tried to use the SSE8 (https://github.com/athoik/dab-cmdline/commits/test-sse82neon) but it doesn't work with real data (bus error).
I still didn't figure out how to use the nrsc5 version on dab-cmdline.
@awesie , @JvanKatwijk it's great to have two great projects inspiring each other.
Hi,
The #41 already implements an decoder, but I think we are not getting the maximum performance.
I have already locate two projects that make use of viterbi decoder and implemented with NEON.
The most promising project is the https://github.com/theori-io/nrsc5, I would call that project, the "brother project" of https://github.com/JvanKatwijk/dab-cmdline for our friends in US.
The viterbi decoder is available here: https://github.com/theori-io/nrsc5/blob/master/src/conv_neon.h and here: https://github.com/theori-io/nrsc5/blob/master/src/conv_dec.c
Also SSE (https://github.com/theori-io/nrsc5/blob/master/src/conv_sse.h) and generic version exists (https://github.com/theori-io/nrsc5/blob/master/src/conv_dec.c).
The code seems very nice and most probably we need something like the following:
It would be great if @awesie can help us integrate or provide a test program like spiral.net provides, in order to test performance.
The second project is srsLTE https://github.com/srsLTE/srsLTE They also have a generic, SSE and NEON version of viterbi (https://github.com/srsLTE/srsLTE/blob/master/lib/src/phy/fec/viterbi37_neon.c). They also have a test program (https://github.com/srsLTE/srsLTE/blob/master/lib/src/phy/fec/test/viterbi_test.c).
It would be great, to use viterbi from above projects, if their implementations offer better performance from SSE2NEON already available on #41.
Thanks.