JvanKatwijk / dab-cmdline

DAB decoding library with example of its use
GNU General Public License v2.0
57 stars 29 forks source link

Optimized Viterbi decoder using NEON instructions #42

Closed athoik closed 6 years ago

athoik commented 6 years ago

Hi,

The #41 already implements an decoder, but I think we are not getting the maximum performance.

I have already locate two projects that make use of viterbi decoder and implemented with NEON.

The most promising project is the https://github.com/theori-io/nrsc5, I would call that project, the "brother project" of https://github.com/JvanKatwijk/dab-cmdline for our friends in US.

The viterbi decoder is available here: https://github.com/theori-io/nrsc5/blob/master/src/conv_neon.h and here: https://github.com/theori-io/nrsc5/blob/master/src/conv_dec.c

Also SSE (https://github.com/theori-io/nrsc5/blob/master/src/conv_sse.h) and generic version exists (https://github.com/theori-io/nrsc5/blob/master/src/conv_dec.c).

The code seems very nice and most probably we need something like the following:

int conv_decode_p4(const int8_t *in, uint8_t *out)
{
    const struct lte_conv_code code = {
        .n = 4,
        .k = 7,
        .len = 2048,
        .gen = { 109, 73, 83, 109 },
        .term = CONV_TERM_TAIL_BITING,
    };
    int rc;

    struct vdecoder *vdec = alloc_vdec(&code);
    if (!vdec)
        return -EFAULT;

    reset_decoder(vdec, code.term);

    /* Propagate through the trellis with interval normalization */
    _conv_decode(vdec, in, code.len);

    if (code.term == CONV_TERM_TAIL_BITING)
        _conv_decode(vdec, in, code.len);

    rc = traceback(vdec, out, code.term, code.len);

    free_vdec(vdec);
    return rc;
}

It would be great if @awesie can help us integrate or provide a test program like spiral.net provides, in order to test performance.

The second project is srsLTE https://github.com/srsLTE/srsLTE They also have a generic, SSE and NEON version of viterbi (https://github.com/srsLTE/srsLTE/blob/master/lib/src/phy/fec/viterbi37_neon.c). They also have a test program (https://github.com/srsLTE/srsLTE/blob/master/lib/src/phy/fec/test/viterbi_test.c).

It would be great, to use viterbi from above projects, if their implementations offer better performance from SSE2NEON already available on #41.

Thanks.

awesie commented 6 years ago

I was intrigued so I spent some time trying to get code from the spiral website and my NEON version to work together.

Some quick performance results:

rate = 1/3, K = 7, frame size = 2048

nrsc5-NEON
Execution time for 10000 2054-bit frames: 4.96 sec
decoder speed: 4129.03 kbits/s

Spiral C
Execution time for 10000 2054-bit frames: 12.29 sec
decoder speed: 1666.4 kbits/s

Spiral SSE 4-way (using SSE2NEON)
Execution time for 10000 2054-bit frames: 9.73 sec
decoder speed: 2104.83 kbits/s

Spiral SSE 8-way (using SSE2NEON)
Execution time for 10000 2054-bit frames: 5.16 sec
decoder speed: 3968.99 kbits/s

rate = 1/4, K = 7, frame size = 2048

nrsc5-NEON
Execution time for 100000 2054-bit frames: 45.71 sec
decoder speed: 4480.42 kbits/s

Spiral SSE 8-way (using SSE2NEON)
Execution time for 100000 2054-bit frames: 55.93 sec
decoder speed: 3661.72 kbits/s

tl;dr If you are currently using the SSE 8-way from Spiral for NEON, then you may get a minor performance improvement. Obviously, parameters and such may affect the benefit.

My test code is attached: test-nrsc5-viterbi.zip. I also tested on x86-64, in that case the Spiral code is significantly faster than what is currently in nrsc5.

JvanKatwijk commented 6 years ago

Just to inform.

In DAB each DABFrame (app 10 a second) there are 4 FIC's, each 3072 elements. The data contents - usually one segment per DABframe - also has to pass the deconvolutional decoding. Different audio segments have different sizes, most of them with size smaller than 3072, some of them longer. DAB uses 1/4 K=7

2018-02-24 23:43 GMT+01:00 Andrew Wesie notifications@github.com:

I was intrigued so I spent some time trying to get code from the spiral website and my NEON version to work together.

Some quick performance results:

rate = 1/3, K = 7, frame size = 2048

nrsc5-NEON Execution time for 10000 2054-bit frames: 4.96 sec decoder speed: 4129.03 kbits/s

Spiral C Execution time for 10000 2054-bit frames: 12.29 sec decoder speed: 1666.4 kbits/s

Spiral SSE 4-way (using SSE2NEON) Execution time for 10000 2054-bit frames: 9.73 sec decoder speed: 2104.83 kbits/s

Spiral SSE 8-way (using SSE2NEON) Execution time for 10000 2054-bit frames: 5.16 sec decoder speed: 3968.99 kbits/s

rate = 1/4, K = 7, frame size = 2048

nrsc5-NEON Execution time for 100000 2054-bit frames: 45.71 sec decoder speed: 4480.42 kbits/s

Spiral SSE 8-way (using SSE2NEON) Execution time for 100000 2054-bit frames: 55.93 sec decoder speed: 3661.72 kbits/s

tl;dr If you are currently using the SSE 8-way from Spiral for NEON, then you may get a minor performance improvement. Obviously, parameters and such may affect the benefit.

My test code is attached: test-nrsc5-viterbi.zip https://github.com/JvanKatwijk/dab-cmdline/files/1755395/test-nrsc5-viterbi.zip. I also tested on x86-64, in that case the Spiral code is significantly faster than what is currently in nrsc5.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JvanKatwijk/dab-cmdline/issues/42#issuecomment-368266184, or mute the thread https://github.com/notifications/unsubscribe-auth/AITzwJAAfxZdX7joNy3G8KNpngSa7HZfks5tYJCmgaJpZM4SR4rl .

-- Jan van Katwijk

+31 (0)15 3698980 +31 (0) 628260355

athoik commented 6 years ago

Dear @awesie,

Thanks for your sample!

Indeed your version of viterbi seems to produce better results!

Spiral C
Execution time for 10000 2054-bit frames: 9.01 sec
decoder speed: 2272.27 kbits/s

Spiral SSE 8-way (using SSE2NEON)
Execution time for 100000 2054-bit frames: 40.30 sec
decoder speed: 5082.01 kbits/s

nrsc5-NEON
Execution time for 100000 2054-bit frames: 36.25 sec
decoder speed: 5649.34 kbits/s

Getting almost 150% improvement over C version with nrsc5 is excellent!

The speed of Spiral SSE-16 way on my computer is really high (running under oracle vbox) , but on a Xeon CPU (bare metal) is "astronomical"!

Spiral SSE 16-way (on Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz)
Execution time for 10000 2054-bit frames: 0.28 sec
decoder speed: 73142.9 kbits/s

Spiral SSE 16-way (on Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz)
Execution time for 10000 2054-bit frames: 0.61 sec
decoder speed: 33684.2 kbits/s

I tried to use the SSE8 (https://github.com/athoik/dab-cmdline/commits/test-sse82neon) but it doesn't work with real data (bus error).

I still didn't figure out how to use the nrsc5 version on dab-cmdline.

@awesie , @JvanKatwijk it's great to have two great projects inspiring each other.