add SSE support - Githubissues

jg1uaa commented 4 years ago

see issue #24

drowe67 commented 4 years ago

@jg1uaa thanks, I will take a look.

@hobbes1069 is there as way to force SSE or AVX when cmake is run? I have a machine that supports both and would like to try compiling either/or to compare.

jg1uaa commented 4 years ago

I build LPCNet on OpenBSD and it cannot work automatic CPU detection. so use DISABLE_CPU_OPTIMIZATION and SSE, AVX, AVX2 to specify SIMD extention like this.

cmake -DDISABLE_CPU_OPTIMIZATION=ON -DSSE=ON -DCODEC2_BUILD_DIR=/path/to/codec2 /build_directory/ ..

jg1uaa commented 4 years ago

I took benchmark on my machine, OpenBSD-6.6/amd64 (clang 8.0.1) with A10-7860K . method is simply piped between lpcnet_enc and lpcnet_dec, so I didn't use other module like this.

cd /path/to/LPCNet
cd build_directory/src
time cat ../../wav/all.wav | ./lpcnet_enc -s > test.out
time cat test.out | ./lpcnet_dec -s > /dev/null

here is the result:

option      encode  decode
-msse       2.609s  1m17.046s
-msse2      2.552s  1m17.323s
-msse3      2.554s  1m17.117s
-msse4.1    2.550s  42.694s
-msse4.2    2.561s  42.697s
-mavx       2.567s  21.907s
(noSIMD)    2.552s  1m42.005s

all.wav tooks 49seconds so I think the encode/decode time should be enough smaller than that time. I will check your benchmark method and ctest later.

drowe67 commented 4 years ago

@jg1uaa sorry my last comment was incorrect, my mistake!

Thanks - your cmake line worked for me, and the SIMD ctest works :1st_place_medal:

ctest -V -R SIMD_functions
5: sgemv_accum16.....................: pass
5: sparse_sgemv_accum16..............: pass
1/1 Test #5: SIMD_functions ...................   Passed    0.03 sec

Here are some comparative timings on my Lenovo X230 (i5-3320M CPU @ 2.60GHz). all.wav is a 50 second file.

$ cd LPCNet/build_linux/src
$ time sox ../../wav/all.wav -t raw -r 16000 - | ./dump_data --c2pitch --test - - | ./test_lpcnet - /dev/null

SIMD	Time (s)	% real time
None	49.216	100
AVX	15.354	38
SSE	26.608	53

I think the SSE results are pretty good :smile:

drowe67 commented 4 years ago

I took benchmark on my machine, OpenBSD-6.6/amd64 (clang 8.0.1) with A10-7860K . method is simply piped between lpcnet_enc and lpcnet_dec, so I didn't use other module like this.

OK your results are quite a bit slower than my machine, that's curious. I will try a few more machines I have over the next few days.

idea: we could also make .travis.yml build both AVX and SSE versions and run the all.wav test on both, printing the execution time for comparison.

jg1uaa commented 4 years ago

Hi, here are results of benchmark by your method.

OpenBSD-6.6/amd64 (clang-8.0.1), AMD A10-7860K @ 3.6GHz (OS does not support turbo mode)

SIMD	Time(s)	% real time
None	101.988	205
SSE	76.672	154
SSE2	76.774	154
SSE3	76.549	154
SSE4.1	42.371	85
AVX	21.449	43

Debian-10.3/amd64 (gcc-8.3.0) via VMware Player 15/Windows10 (1909), AMD-A8-7670K @ 3.6GHz (turbo disabled)

SIMD	Time(s)	% real time
None	72.498	146
SSE	50.598	102
SSE2	50.767	102
SSE3	50.873	102
SSE4.1	40.661	82
AVX	21.196	43

gcc looks slightly faster than clang, but not big difference. Both A8-7670K and A10-7860K has Steamroller CPU core, a family of Bulldozer (AMD FX). These cores have AVX/SSE support but performance is less than Intel's CPU.

hobbes1069 commented 4 years ago

For posterity (and because David and I had a discussion about using brute force without optimizations) here's the results:

Fedora 31
gcc 9.3.1
Ryzen 5 2600

SIMD	Time (s)	% real time	%improvement
None	19.796	39.8%	0.0%
SSE 4.1	17.971	36.1%	9.2%
AVX	10.185	20.5%	48.6%
AVX2	9.459	19.0%	52.2%

Since the binary is single threaded it probably didn't make a difference but I'm also running Folding@Home on my RX 580 and have wsjtx running in the background, which hits one core really hard every 15 seconds.

drowe67 commented 4 years ago

@hobbes1069 wow that's a fast machine! I stand corrected - you could indeed run 2020 without acceleration on your machine.

I'm inclined to set the bar at about 50% loading, as I know it works for me. It's also clear there are wide variations across machines.

One of the issues here is support. When I released 700D I had to deal with some really ancient machines (e.g. Windows XP), that appear common in the Ham community. I'd like to use automation to avoid support issues of "it doesn't work" because someone's machine is too slow.

However since releasing 2020 I've become aware of a class of "modern, but without AVX" machines which might just run with SSE (or indeed no acceleration like yours Richard).

A couple tasks for future PRs:

How to build for AVX and SSE at the same time - then a way to choose which acceleration technology at run time.
How to determine if 2020 can run on a given machine. Perhaps a small automated speed test, as a LPCNet_freedv API function.

kkofler commented 4 years ago

The thing is, SSE4.1 is still too much to assume for GNU/Linux distribution packaging, the baseline for x86_64 is SSE2. (Most distributions also assume SSE2 for 32-bit x86 these days, if they support it at all.)

John Reiser suggests using multi-threading in addition to the vectorization. That might make SSE2 actually work on machines where it matters. (SSE2 or even unvectorized plain C being fast enough on Ryzen 5 CPUs that also support AVX2 is unfortunately mostly academic.)

hobbes1069 commented 4 years ago

Currently FreeDV (the only intended consumer of lpcnet) disables the 2020 mode if AVX is not detected. What we may need to do is pop up a "warning" and allow end end user to override and enable 2020 anyway.

That at least solves the question of whether we should provide distro packages that don't have optimizations. I think the simple solution in the short term is to build both non-optimized and AVX packages (build twice) and provide separate packages:

lpcnetfreedv lpcnetfreedv-avx

The second of which would both have a virtual provide for and conflict with "lpcnetfreedv".

Not ideal at all, but as this is an "experimental" mode the intended audience should be able to deal with it easily.

jg1uaa commented 4 years ago

@kkofler SSE4.1 is mandatory for performance.

On i386 we can use 8 SSE registers but x86_64 supports 16 registers. This difference makes great performance/optimization to applications such as LPCNet. I think LPCNet-SSE should be supported only x86_64 and SSE4 supported processor. At least all Core-i processor has 64bit mode and SSE4.1 instruction set, no problem.

I tested SSE2/3 on ancient Athlon64-3000+ processor, its benchmark result was very poor.

Debian-10.3/amd64 (gcc-8.3.0), AMD-Athlon64-3000+ @ 2.0GHz

SIMD	32bit (s)	64bit (s)
(no SIMD)	322.437	147.914
-msse	327.245	147.977
-msse2	331.520	149.590
-msse3	N/A	150.772
-msse4.1	N/A	N/A
-mavx	N/A	N/A

I don't know why SSE3 optimized code run not-supported processor (luckily SSE3 instruction was not used?).

drowe67 commented 4 years ago

FYI @hobbes1069 and I are brainstorming this one on https://github.com/drowe67/LPCNet/issues/27

kkofler commented 4 years ago

It looks like the difference between all the -msse* flags there is compiler vectorization only. The code you submitted uses only SSE1 intrinsics, so there is no reason why -msse4.1 would by itself make it any faster. SSE4.1 intrinsics are in smmintrin.h, not xmmintrin.h.

And typically, just using -msse2 is not enough to get vectorized code from the compiler, you need to actually turn on compiler vectorization, though some of it is now automatically enabled by GCC at -O3, which I guess you are using. Try using -free-vectorize in addition. But to get really effective SSE2 code, try actually using SSE2 intrinsics from emmintrin.h instead of SSE 1 ones from xmmintrin.h. And the compiler vectorization might actually work better on the plain C code than on your code using SSE 1 intrinsics, so if you do not want to try writing actual SSE2 intrinsics, please at least try that.

This also explains why -msse3 can end up using no SSE3 instructions, if the compiler finds nothing to vectorize with SSE3. If you actually use SSE3 intrinsics from pmmintrin.h, you will definitely get SSE3 code.

kkofler commented 4 years ago

Likewise, there are also separate preprocessor flags for __SSE2__ etc., __SSE__ only checks for SSE 1 availability (but that is all that your current code actually requires, only the CMakeLists.txt wants SSE 4.1).

I have no authority over this project, but if I did, I would have rejected this pull request as is, for a simple reason: If you want to target SSE4.1 machines, you should use smmintrin.h intrinsics and check __SSE4_1__. If you submit SSE 1 code, it should be built on any CPU supporting SSE 1 (even if it is very likely to be too slow – the plain C code will be even slower).

And I urge you to try emmintrin.h SSE2 intrinsics, too.

kkofler commented 4 years ago

Looking closer at your vec_sse.h (and the older vec*.h headers), I see that it only really does one type of vector operation: single-precision float multiplications and additions, and that newer SSEx intrinsics probably won't help with those. As far as I can tell, SSE 1 already does these operations with 128-bit vectors of float, and the next higher is 256-bit vectors of float, added by AVX 1. SSE2 adds double and integer versions, which are not needed here, as far as I can tell.

What I don't see either, though, is why SSE4.1 helps. I suspect that GCC may be automatically vectorizing some code elsewhere not currently covered by vec*.h at all.

hobbes1069 commented 4 years ago

Thanks for digging into this Kevin. So it sounds like we should be targeting AVX? I noted on my system AVX2 showed very little improvement.

jg1uaa commented 4 years ago

We know that says "I will write SSE1 intrinsics in my code". This does not mean using another intrinsics header file will automatically replace newer-version SSE instruction. It is important that what coders wrote.

I used in vec_sse.h because the code uses only SSE1 instruction. If you will need another version SSE instructions in this code, replace suitable header or simply include that is for all x86 intrinsics (MMX/SSE/AVX and future extensions).

But other portion (outside vec_sse.h) are controlled by -msseX flag. SSE4.1 flag-enabled object has very good performance, maybe there is good SSE4.1 instruction for compiler.

I took benchmarks with some combinations of compiler flags. It is said that -free-vectorize option as default of -O3 option. No need to discuss about that, but it is interesting result about -ffast-math.

I thought 32bit/SSE(all) and 64bit/SSE(1~3) has poor performance and not usable but this option turns them good. Of course we have to care about side-effects of -ffast-math, but it is worth considering.

flags	A10/64bit	A8/64bit	Ath3000/64bit	A8/32bit	Ath3000/32bit
-O3	76.188	50.025	149.077	314.338	382.752
-O3 -msse	76.073	49.667	149.702	223.131	322.325
-O3 -msse2	76.135	49.665	150.885	223.834	322.434
-O3 -msse3	76.152	49.732	151.59	216.926	N/A
-O3 -msse4.1	42.03	39.125	N/A	217.141	N/A
-O3 -mavx	21.015	21.014	N/A	21.288	N/A
-O3 -mavx2 -mfma	N/A	N/A	N/A	N/A	N/A
-O3 -ftree-vectorize	76.362	49.869	157.634	311.65	382.083
-O3 -ftree-vectorize -msse	76.321	49.69	151.02	223.675	336.876
-O3 -ftree-vectorize -msse2	76.3	49.82	153.214	222.964	326.009
-O3 -ftree-vectorize -msse3	76.224	49.976	151.591	216.132	N/A
-O3 -ftree-vectorize -msse4.1	42.006	39.314	N/A	215.871	N/A
-O3 -ftree-vectorize -mavx	20.873	21.246	N/A	21.192	N/A
-O3 -ftree-vectorize -mavx2 -mfma	N/A	N/A	N/A	N/A	N/A
-O3 -fno-tree-vectorize	81.821	52.993	155.804	294.387	390.964
-O3 -fno-tree-vectorize -msse	81.757	53.096	160.589	227.272	325.384
-O3 -fno-tree-vectorize -msse2	81.726	53.115	154.341	227.661	325.888
-O3 -fno-tree-vectorize -msse3	81.847	53.173	157.139	221.434	N/A
-O3 -fno-tree-vectorize -msse4.1	44.325	42.566	N/A	221.352	N/A
-O3 -fno-tree-vectorize -mavx	22.751	24.297	N/A	25.116	N/A
-O3 -fno-tree-vectorize -mavx2 -mfma	N/A	N/A	N/A	N/A	N/A
-O3 -DFLOAT_APPROX -ffast-math	71.648	39.298	133.345	98.834	184.772
-O3 -DFLOAT_APPROX -ffast-math -msse	71.752	39.466	127.423	44.555	142.49
-O3 -DFLOAT_APPROX -ffast-math -msse2	71.801	39.219	125.526	38.455	122.256
-O3 -DFLOAT_APPROX -ffast-math -msse3	71.569	39.528	N/A	38.666	N/A
-O3 -DFLOAT_APPROX -ffast-math -msse4.1	31.662	39.328	N/A	38.518	N/A
-O3 -DFLOAT_APPROX -ffast-math -mavx	19.763	20.825	N/A	20.416	N/A
-O3 -DFLOAT_APPROX -ffast-math -mavx2 -mfma	N/A	N/A	N/A	N/A	N/A
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize	76.757	42.637	130.22	99.138	188.404
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -msse	76.588	42.729	160.975	49.015	145.562
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -msse2	76.703	42.851	132.309	42.176	127.573
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -msse3	76.776	42.573	130.951	42.18	132.809
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -msse4.1	34.382	43.402	N/A	41.826	N/A
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -mavx	21.751	23.563	N/A	23.206	N/A
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -mavx2 -mfma	N/A	N/A	N/A	N/A	N/A
-Ofast -DFLOAT_APPROX	68.393	39.127	126.724	98.815	188.336
-Ofast -DFLOAT_APPROX -msse	68.206	39.334	132.759	44.71	138.427
-Ofast -DFLOAT_APPROX -msse2	68.135	39.099	126.228	38.914	123.181
-Ofast -DFLOAT_APPROX -msse3	67.567	39.546	N/A	39.035	N/A
-Ofast -DFLOAT_APPROX -msse4.1	28.1	39.23	N/A	38.701	N/A
-Ofast -DFLOAT_APPROX -mavx	16.164	21.136	N/A	20.317	N/A
-Ofast -DFLOAT_APPROX -mavx2 -mfma	N/A	N/A	N/A	N/A	N/A

A10: OpenBSD-6.6/amd64 (clang-8.0.1), AMD A10-7860K @ 3.6GHz (OS does not support turbo mode)
A8: Debian-10.3/amd64 (gcc-8.3.0) via VMware Player 15/Windows10 (1909), AMD-A8-7670K @ 3.6GHz (turbo disabled)
Ath3000: Debian-10.3/amd64 (gcc-8.3.0), AMD-Athlon64-3000+ @ 2.0GHz

I think candidates of compiler flags combination are

safe: 64bit AVX (-O3 -mavx) 64bit SSE4.1 (-O3 -msse4.1) 32bit AVX (-O3 -mavx)

expetimental: 64bit SSE2 (-O3 -DFLOAT_APPROX -ffast-math -msse2) 32bit SSE (-O3 -DFLOAT_APPROX -ffast-math -msse)

if there is no problem using -ffast-math, we can simplify them.

AVX mode: 64bit AVX (-O3 -mavx) 32bit AVX (-O3 -mavx)

SSE mode: 64bit SSE2 (-O3 -DFLOAT_APPROX -ffast-math -msse2) 32bit SSE (-O3 -DFLOAT_APPROX -ffast-math -msse)

kkofler commented 4 years ago

We know that says "I will write SSE1 intrinsics in my code". This does not mean using another intrinsics header file will automatically replace newer-version SSE instruction.

Of course. Including a header file does not do anything by itself. You need to actually use the functions declared in the header file to make use of newer SSE instruction sets.

But as I had already written, to be honest, I do not see what functions one would use to further vectorize the operations on 128-bit vectors of single-precision float with less than AVX 1 (which is what introduced 256-bit vectors of float (and other 256-bit vectors)). As I had already stated, I think that the compiler must be finding places to vectorize with SSE4.1 that are not covered by vec*.h.

drowe67 commented 4 years ago

@jg1uaa that is some very comprehensive testing :+1:

To use SSE I'd like to have a few other features, like a way to test the end users CPU at run time. I have a task list here. if you are interested in working on any of these tasks please let me know :smile:

drowe67 / LPCNet

add SSE support #25