drowe67 / LPCNet

Experimental Neural Net speech coding for FreeDV
BSD 3-Clause "New" or "Revised" License
68 stars 24 forks source link

add SSE support #25

Closed jg1uaa closed 4 years ago

jg1uaa commented 4 years ago

see issue #24

drowe67 commented 4 years ago

@jg1uaa thanks, I will take a look.

@hobbes1069 is there as way to force SSE or AVX when cmake is run? I have a machine that supports both and would like to try compiling either/or to compare.

jg1uaa commented 4 years ago

I build LPCNet on OpenBSD and it cannot work automatic CPU detection. so use DISABLE_CPU_OPTIMIZATION and SSE, AVX, AVX2 to specify SIMD extention like this.

cmake -DDISABLE_CPU_OPTIMIZATION=ON -DSSE=ON -DCODEC2_BUILD_DIR=/path/to/codec2 /build_directory/ ..

jg1uaa commented 4 years ago

I took benchmark on my machine, OpenBSD-6.6/amd64 (clang 8.0.1) with A10-7860K . method is simply piped between lpcnet_enc and lpcnet_dec, so I didn't use other module like this.

cd /path/to/LPCNet
cd build_directory/src
time cat ../../wav/all.wav | ./lpcnet_enc -s > test.out
time cat test.out | ./lpcnet_dec -s > /dev/null

here is the result:

option      encode  decode
-msse       2.609s  1m17.046s
-msse2      2.552s  1m17.323s
-msse3      2.554s  1m17.117s
-msse4.1    2.550s  42.694s
-msse4.2    2.561s  42.697s
-mavx       2.567s  21.907s
(noSIMD)    2.552s  1m42.005s

all.wav tooks 49seconds so I think the encode/decode time should be enough smaller than that time. I will check your benchmark method and ctest later.

drowe67 commented 4 years ago

@jg1uaa sorry my last comment was incorrect, my mistake!

Thanks - your cmake line worked for me, and the SIMD ctest works :1st_place_medal:

ctest -V -R SIMD_functions
5: sgemv_accum16.....................: pass
5: sparse_sgemv_accum16..............: pass
1/1 Test #5: SIMD_functions ...................   Passed    0.03 sec

Here are some comparative timings on my Lenovo X230 (i5-3320M CPU @ 2.60GHz). all.wav is a 50 second file.

$ cd LPCNet/build_linux/src
$ time sox ../../wav/all.wav -t raw -r 16000 - | ./dump_data --c2pitch --test - - | ./test_lpcnet - /dev/null
SIMD Time (s) % real time
None 49.216 100
AVX 15.354 38
SSE 26.608 53

I think the SSE results are pretty good :smile:

drowe67 commented 4 years ago

I took benchmark on my machine, OpenBSD-6.6/amd64 (clang 8.0.1) with A10-7860K . method is simply piped between lpcnet_enc and lpcnet_dec, so I didn't use other module like this.

OK your results are quite a bit slower than my machine, that's curious. I will try a few more machines I have over the next few days.

idea: we could also make .travis.yml build both AVX and SSE versions and run the all.wav test on both, printing the execution time for comparison.

jg1uaa commented 4 years ago

Hi, here are results of benchmark by your method.

SIMD Time(s) % real time
None 101.988 205
SSE 76.672 154
SSE2 76.774 154
SSE3 76.549 154
SSE4.1 42.371 85
AVX 21.449 43
SIMD Time(s) % real time
None 72.498 146
SSE 50.598 102
SSE2 50.767 102
SSE3 50.873 102
SSE4.1 40.661 82
AVX 21.196 43

gcc looks slightly faster than clang, but not big difference. Both A8-7670K and A10-7860K has Steamroller CPU core, a family of Bulldozer (AMD FX). These cores have AVX/SSE support but performance is less than Intel's CPU.

hobbes1069 commented 4 years ago

For posterity (and because David and I had a discussion about using brute force without optimizations) here's the results:

SIMD Time (s) % real time %improvement
None 19.796 39.8% 0.0%
SSE 4.1 17.971 36.1% 9.2%
AVX 10.185 20.5% 48.6%
AVX2 9.459 19.0% 52.2%

Since the binary is single threaded it probably didn't make a difference but I'm also running Folding@Home on my RX 580 and have wsjtx running in the background, which hits one core really hard every 15 seconds.

drowe67 commented 4 years ago

@hobbes1069 wow that's a fast machine! I stand corrected - you could indeed run 2020 without acceleration on your machine.

I'm inclined to set the bar at about 50% loading, as I know it works for me. It's also clear there are wide variations across machines.

One of the issues here is support. When I released 700D I had to deal with some really ancient machines (e.g. Windows XP), that appear common in the Ham community. I'd like to use automation to avoid support issues of "it doesn't work" because someone's machine is too slow.

However since releasing 2020 I've become aware of a class of "modern, but without AVX" machines which might just run with SSE (or indeed no acceleration like yours Richard).

A couple tasks for future PRs:

  1. How to build for AVX and SSE at the same time - then a way to choose which acceleration technology at run time.
  2. How to determine if 2020 can run on a given machine. Perhaps a small automated speed test, as a LPCNet_freedv API function.
kkofler commented 4 years ago

The thing is, SSE4.1 is still too much to assume for GNU/Linux distribution packaging, the baseline for x86_64 is SSE2. (Most distributions also assume SSE2 for 32-bit x86 these days, if they support it at all.)

John Reiser suggests using multi-threading in addition to the vectorization. That might make SSE2 actually work on machines where it matters. (SSE2 or even unvectorized plain C being fast enough on Ryzen 5 CPUs that also support AVX2 is unfortunately mostly academic.)

hobbes1069 commented 4 years ago

Currently FreeDV (the only intended consumer of lpcnet) disables the 2020 mode if AVX is not detected. What we may need to do is pop up a "warning" and allow end end user to override and enable 2020 anyway.

That at least solves the question of whether we should provide distro packages that don't have optimizations. I think the simple solution in the short term is to build both non-optimized and AVX packages (build twice) and provide separate packages:

lpcnetfreedv lpcnetfreedv-avx

The second of which would both have a virtual provide for and conflict with "lpcnetfreedv".

Not ideal at all, but as this is an "experimental" mode the intended audience should be able to deal with it easily.

jg1uaa commented 4 years ago

@kkofler SSE4.1 is mandatory for performance.

On i386 we can use 8 SSE registers but x86_64 supports 16 registers. This difference makes great performance/optimization to applications such as LPCNet. I think LPCNet-SSE should be supported only x86_64 and SSE4 supported processor. At least all Core-i processor has 64bit mode and SSE4.1 instruction set, no problem.

I tested SSE2/3 on ancient Athlon64-3000+ processor, its benchmark result was very poor.

SIMD 32bit (s) 64bit (s)
(no SIMD) 322.437 147.914
-msse 327.245 147.977
-msse2 331.520 149.590
-msse3 N/A 150.772
-msse4.1 N/A N/A
-mavx N/A N/A

I don't know why SSE3 optimized code run not-supported processor (luckily SSE3 instruction was not used?).

drowe67 commented 4 years ago

FYI @hobbes1069 and I are brainstorming this one on https://github.com/drowe67/LPCNet/issues/27

kkofler commented 4 years ago

It looks like the difference between all the -msse* flags there is compiler vectorization only. The code you submitted uses only SSE1 intrinsics, so there is no reason why -msse4.1 would by itself make it any faster. SSE4.1 intrinsics are in smmintrin.h, not xmmintrin.h.

And typically, just using -msse2 is not enough to get vectorized code from the compiler, you need to actually turn on compiler vectorization, though some of it is now automatically enabled by GCC at -O3, which I guess you are using. Try using -free-vectorize in addition. But to get really effective SSE2 code, try actually using SSE2 intrinsics from emmintrin.h instead of SSE 1 ones from xmmintrin.h. And the compiler vectorization might actually work better on the plain C code than on your code using SSE 1 intrinsics, so if you do not want to try writing actual SSE2 intrinsics, please at least try that.

This also explains why -msse3 can end up using no SSE3 instructions, if the compiler finds nothing to vectorize with SSE3. If you actually use SSE3 intrinsics from pmmintrin.h, you will definitely get SSE3 code.

kkofler commented 4 years ago

Likewise, there are also separate preprocessor flags for __SSE2__ etc., __SSE__ only checks for SSE 1 availability (but that is all that your current code actually requires, only the CMakeLists.txt wants SSE 4.1).

I have no authority over this project, but if I did, I would have rejected this pull request as is, for a simple reason: If you want to target SSE4.1 machines, you should use smmintrin.h intrinsics and check __SSE4_1__. If you submit SSE 1 code, it should be built on any CPU supporting SSE 1 (even if it is very likely to be too slow – the plain C code will be even slower).

And I urge you to try emmintrin.h SSE2 intrinsics, too.

kkofler commented 4 years ago

Looking closer at your vec_sse.h (and the older vec*.h headers), I see that it only really does one type of vector operation: single-precision float multiplications and additions, and that newer SSEx intrinsics probably won't help with those. As far as I can tell, SSE 1 already does these operations with 128-bit vectors of float, and the next higher is 256-bit vectors of float, added by AVX 1. SSE2 adds double and integer versions, which are not needed here, as far as I can tell.

What I don't see either, though, is why SSE4.1 helps. I suspect that GCC may be automatically vectorizing some code elsewhere not currently covered by vec*.h at all.

hobbes1069 commented 4 years ago

Thanks for digging into this Kevin. So it sounds like we should be targeting AVX? I noted on my system AVX2 showed very little improvement.

jg1uaa commented 4 years ago

We know that says "I will write SSE1 intrinsics in my code". This does not mean using another intrinsics header file will automatically replace newer-version SSE instruction. It is important that what coders wrote.

I used in vec_sse.h because the code uses only SSE1 instruction. If you will need another version SSE instructions in this code, replace suitable header or simply include that is for all x86 intrinsics (MMX/SSE/AVX and future extensions).

But other portion (outside vec_sse.h) are controlled by -msseX flag. SSE4.1 flag-enabled object has very good performance, maybe there is good SSE4.1 instruction for compiler.

I took benchmarks with some combinations of compiler flags. It is said that -free-vectorize option as default of -O3 option. No need to discuss about that, but it is interesting result about -ffast-math.

I thought 32bit/SSE(all) and 64bit/SSE(1~3) has poor performance and not usable but this option turns them good. Of course we have to care about side-effects of -ffast-math, but it is worth considering.

flags A10/64bit A8/64bit Ath3000/64bit A8/32bit Ath3000/32bit
-O3 76.188 50.025 149.077 314.338 382.752
-O3 -msse 76.073 49.667 149.702 223.131 322.325
-O3 -msse2 76.135 49.665 150.885 223.834 322.434
-O3 -msse3 76.152 49.732 151.59 216.926 N/A
-O3 -msse4.1 42.03 39.125 N/A 217.141 N/A
-O3 -mavx 21.015 21.014 N/A 21.288 N/A
-O3 -mavx2 -mfma N/A N/A N/A N/A N/A
-O3 -ftree-vectorize 76.362 49.869 157.634 311.65 382.083
-O3 -ftree-vectorize -msse 76.321 49.69 151.02 223.675 336.876
-O3 -ftree-vectorize -msse2 76.3 49.82 153.214 222.964 326.009
-O3 -ftree-vectorize -msse3 76.224 49.976 151.591 216.132 N/A
-O3 -ftree-vectorize -msse4.1 42.006 39.314 N/A 215.871 N/A
-O3 -ftree-vectorize -mavx 20.873 21.246 N/A 21.192 N/A
-O3 -ftree-vectorize -mavx2 -mfma N/A N/A N/A N/A N/A
-O3 -fno-tree-vectorize 81.821 52.993 155.804 294.387 390.964
-O3 -fno-tree-vectorize -msse 81.757 53.096 160.589 227.272 325.384
-O3 -fno-tree-vectorize -msse2 81.726 53.115 154.341 227.661 325.888
-O3 -fno-tree-vectorize -msse3 81.847 53.173 157.139 221.434 N/A
-O3 -fno-tree-vectorize -msse4.1 44.325 42.566 N/A 221.352 N/A
-O3 -fno-tree-vectorize -mavx 22.751 24.297 N/A 25.116 N/A
-O3 -fno-tree-vectorize -mavx2 -mfma N/A N/A N/A N/A N/A
-O3 -DFLOAT_APPROX -ffast-math 71.648 39.298 133.345 98.834 184.772
-O3 -DFLOAT_APPROX -ffast-math -msse 71.752 39.466 127.423 44.555 142.49
-O3 -DFLOAT_APPROX -ffast-math -msse2 71.801 39.219 125.526 38.455 122.256
-O3 -DFLOAT_APPROX -ffast-math -msse3 71.569 39.528 N/A 38.666 N/A
-O3 -DFLOAT_APPROX -ffast-math -msse4.1 31.662 39.328 N/A 38.518 N/A
-O3 -DFLOAT_APPROX -ffast-math -mavx 19.763 20.825 N/A 20.416 N/A
-O3 -DFLOAT_APPROX -ffast-math -mavx2 -mfma N/A N/A N/A N/A N/A
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize 76.757 42.637 130.22 99.138 188.404
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -msse 76.588 42.729 160.975 49.015 145.562
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -msse2 76.703 42.851 132.309 42.176 127.573
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -msse3 76.776 42.573 130.951 42.18 132.809
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -msse4.1 34.382 43.402 N/A 41.826 N/A
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -mavx 21.751 23.563 N/A 23.206 N/A
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -mavx2 -mfma N/A N/A N/A N/A N/A
-Ofast -DFLOAT_APPROX 68.393 39.127 126.724 98.815 188.336
-Ofast -DFLOAT_APPROX -msse 68.206 39.334 132.759 44.71 138.427
-Ofast -DFLOAT_APPROX -msse2 68.135 39.099 126.228 38.914 123.181
-Ofast -DFLOAT_APPROX -msse3 67.567 39.546 N/A 39.035 N/A
-Ofast -DFLOAT_APPROX -msse4.1 28.1 39.23 N/A 38.701 N/A
-Ofast -DFLOAT_APPROX -mavx 16.164 21.136 N/A 20.317 N/A
-Ofast -DFLOAT_APPROX -mavx2 -mfma N/A N/A N/A N/A N/A

I think candidates of compiler flags combination are

safe: 64bit AVX (-O3 -mavx) 64bit SSE4.1 (-O3 -msse4.1) 32bit AVX (-O3 -mavx)

expetimental: 64bit SSE2 (-O3 -DFLOAT_APPROX -ffast-math -msse2) 32bit SSE (-O3 -DFLOAT_APPROX -ffast-math -msse)

if there is no problem using -ffast-math, we can simplify them.

AVX mode: 64bit AVX (-O3 -mavx) 32bit AVX (-O3 -mavx)

SSE mode: 64bit SSE2 (-O3 -DFLOAT_APPROX -ffast-math -msse2) 32bit SSE (-O3 -DFLOAT_APPROX -ffast-math -msse)

kkofler commented 4 years ago

We know that says "I will write SSE1 intrinsics in my code". This does not mean using another intrinsics header file will automatically replace newer-version SSE instruction.

Of course. Including a header file does not do anything by itself. You need to actually use the functions declared in the header file to make use of newer SSE instruction sets.

But as I had already written, to be honest, I do not see what functions one would use to further vectorize the operations on 128-bit vectors of single-precision float with less than AVX 1 (which is what introduced 256-bit vectors of float (and other 256-bit vectors)). As I had already stated, I think that the compiler must be finding places to vectorize with SSE4.1 that are not covered by vec*.h.

drowe67 commented 4 years ago

@jg1uaa that is some very comprehensive testing :+1:

To use SSE I'd like to have a few other features, like a way to test the end users CPU at run time. I have a task list here. if you are interested in working on any of these tasks please let me know :smile: