Closed jg1uaa closed 4 years ago
@jg1uaa thanks, I will take a look.
@hobbes1069 is there as way to force SSE or AVX when cmake is run? I have a machine that supports both and would like to try compiling either/or to compare.
I build LPCNet on OpenBSD and it cannot work automatic CPU detection. so use DISABLE_CPU_OPTIMIZATION and SSE, AVX, AVX2 to specify SIMD extention like this.
cmake -DDISABLE_CPU_OPTIMIZATION=ON -DSSE=ON -DCODEC2_BUILD_DIR=/path/to/codec2 /build_directory/ ..
I took benchmark on my machine, OpenBSD-6.6/amd64 (clang 8.0.1) with A10-7860K . method is simply piped between lpcnet_enc and lpcnet_dec, so I didn't use other module like this.
cd /path/to/LPCNet
cd build_directory/src
time cat ../../wav/all.wav | ./lpcnet_enc -s > test.out
time cat test.out | ./lpcnet_dec -s > /dev/null
here is the result:
option encode decode
-msse 2.609s 1m17.046s
-msse2 2.552s 1m17.323s
-msse3 2.554s 1m17.117s
-msse4.1 2.550s 42.694s
-msse4.2 2.561s 42.697s
-mavx 2.567s 21.907s
(noSIMD) 2.552s 1m42.005s
all.wav tooks 49seconds so I think the encode/decode time should be enough smaller than that time. I will check your benchmark method and ctest later.
@jg1uaa sorry my last comment was incorrect, my mistake!
Thanks - your cmake line worked for me, and the SIMD ctest works :1st_place_medal:
ctest -V -R SIMD_functions
5: sgemv_accum16.....................: pass
5: sparse_sgemv_accum16..............: pass
1/1 Test #5: SIMD_functions ................... Passed 0.03 sec
Here are some comparative timings on my Lenovo X230 (i5-3320M CPU @ 2.60GHz). all.wav
is a 50 second file.
$ cd LPCNet/build_linux/src
$ time sox ../../wav/all.wav -t raw -r 16000 - | ./dump_data --c2pitch --test - - | ./test_lpcnet - /dev/null
SIMD | Time (s) | % real time |
---|---|---|
None | 49.216 | 100 |
AVX | 15.354 | 38 |
SSE | 26.608 | 53 |
I think the SSE results are pretty good :smile:
I took benchmark on my machine, OpenBSD-6.6/amd64 (clang 8.0.1) with A10-7860K . method is simply piped between lpcnet_enc and lpcnet_dec, so I didn't use other module like this.
OK your results are quite a bit slower than my machine, that's curious. I will try a few more machines I have over the next few days.
idea: we could also make .travis.yml
build both AVX and SSE versions and run the all.wav test on both, printing the execution time for comparison.
Hi, here are results of benchmark by your method.
SIMD | Time(s) | % real time |
---|---|---|
None | 101.988 | 205 |
SSE | 76.672 | 154 |
SSE2 | 76.774 | 154 |
SSE3 | 76.549 | 154 |
SSE4.1 | 42.371 | 85 |
AVX | 21.449 | 43 |
SIMD | Time(s) | % real time |
---|---|---|
None | 72.498 | 146 |
SSE | 50.598 | 102 |
SSE2 | 50.767 | 102 |
SSE3 | 50.873 | 102 |
SSE4.1 | 40.661 | 82 |
AVX | 21.196 | 43 |
gcc looks slightly faster than clang, but not big difference. Both A8-7670K and A10-7860K has Steamroller CPU core, a family of Bulldozer (AMD FX). These cores have AVX/SSE support but performance is less than Intel's CPU.
For posterity (and because David and I had a discussion about using brute force without optimizations) here's the results:
SIMD | Time (s) | % real time | %improvement |
---|---|---|---|
None | 19.796 | 39.8% | 0.0% |
SSE 4.1 | 17.971 | 36.1% | 9.2% |
AVX | 10.185 | 20.5% | 48.6% |
AVX2 | 9.459 | 19.0% | 52.2% |
Since the binary is single threaded it probably didn't make a difference but I'm also running Folding@Home on my RX 580 and have wsjtx running in the background, which hits one core really hard every 15 seconds.
@hobbes1069 wow that's a fast machine! I stand corrected - you could indeed run 2020 without acceleration on your machine.
I'm inclined to set the bar at about 50% loading, as I know it works for me. It's also clear there are wide variations across machines.
One of the issues here is support. When I released 700D I had to deal with some really ancient machines (e.g. Windows XP), that appear common in the Ham community. I'd like to use automation to avoid support issues of "it doesn't work" because someone's machine is too slow.
However since releasing 2020 I've become aware of a class of "modern, but without AVX" machines which might just run with SSE (or indeed no acceleration like yours Richard).
A couple tasks for future PRs:
The thing is, SSE4.1 is still too much to assume for GNU/Linux distribution packaging, the baseline for x86_64 is SSE2. (Most distributions also assume SSE2 for 32-bit x86 these days, if they support it at all.)
John Reiser suggests using multi-threading in addition to the vectorization. That might make SSE2 actually work on machines where it matters. (SSE2 or even unvectorized plain C being fast enough on Ryzen 5 CPUs that also support AVX2 is unfortunately mostly academic.)
Currently FreeDV (the only intended consumer of lpcnet) disables the 2020 mode if AVX is not detected. What we may need to do is pop up a "warning" and allow end end user to override and enable 2020 anyway.
That at least solves the question of whether we should provide distro packages that don't have optimizations. I think the simple solution in the short term is to build both non-optimized and AVX packages (build twice) and provide separate packages:
lpcnetfreedv lpcnetfreedv-avx
The second of which would both have a virtual provide for and conflict with "lpcnetfreedv".
Not ideal at all, but as this is an "experimental" mode the intended audience should be able to deal with it easily.
@kkofler SSE4.1 is mandatory for performance.
On i386 we can use 8 SSE registers but x86_64 supports 16 registers. This difference makes great performance/optimization to applications such as LPCNet. I think LPCNet-SSE should be supported only x86_64 and SSE4 supported processor. At least all Core-i processor has 64bit mode and SSE4.1 instruction set, no problem.
I tested SSE2/3 on ancient Athlon64-3000+ processor, its benchmark result was very poor.
SIMD | 32bit (s) | 64bit (s) |
---|---|---|
(no SIMD) | 322.437 | 147.914 |
-msse | 327.245 | 147.977 |
-msse2 | 331.520 | 149.590 |
-msse3 | N/A | 150.772 |
-msse4.1 | N/A | N/A |
-mavx | N/A | N/A |
I don't know why SSE3 optimized code run not-supported processor (luckily SSE3 instruction was not used?).
FYI @hobbes1069 and I are brainstorming this one on https://github.com/drowe67/LPCNet/issues/27
It looks like the difference between all the -msse*
flags there is compiler vectorization only. The code you submitted uses only SSE1 intrinsics, so there is no reason why -msse4.1
would by itself make it any faster. SSE4.1 intrinsics are in smmintrin.h
, not xmmintrin.h
.
And typically, just using -msse2
is not enough to get vectorized code from the compiler, you need to actually turn on compiler vectorization, though some of it is now automatically enabled by GCC at -O3
, which I guess you are using. Try using -free-vectorize
in addition. But to get really effective SSE2 code, try actually using SSE2 intrinsics from emmintrin.h
instead of SSE 1 ones from xmmintrin.h
. And the compiler vectorization might actually work better on the plain C code than on your code using SSE 1 intrinsics, so if you do not want to try writing actual SSE2 intrinsics, please at least try that.
This also explains why -msse3
can end up using no SSE3 instructions, if the compiler finds nothing to vectorize with SSE3. If you actually use SSE3 intrinsics from pmmintrin.h
, you will definitely get SSE3 code.
Likewise, there are also separate preprocessor flags for __SSE2__
etc., __SSE__
only checks for SSE 1 availability (but that is all that your current code actually requires, only the CMakeLists.txt
wants SSE 4.1).
I have no authority over this project, but if I did, I would have rejected this pull request as is, for a simple reason: If you want to target SSE4.1 machines, you should use smmintrin.h
intrinsics and check __SSE4_1__
. If you submit SSE 1 code, it should be built on any CPU supporting SSE 1 (even if it is very likely to be too slow – the plain C code will be even slower).
And I urge you to try emmintrin.h
SSE2 intrinsics, too.
Looking closer at your vec_sse.h
(and the older vec*.h
headers), I see that it only really does one type of vector operation: single-precision float
multiplications and additions, and that newer SSEx intrinsics probably won't help with those. As far as I can tell, SSE 1 already does these operations with 128-bit vectors of float
, and the next higher is 256-bit vectors of float
, added by AVX 1. SSE2 adds double
and integer versions, which are not needed here, as far as I can tell.
What I don't see either, though, is why SSE4.1 helps. I suspect that GCC may be automatically vectorizing some code elsewhere not currently covered by vec*.h
at all.
Thanks for digging into this Kevin. So it sounds like we should be targeting AVX? I noted on my system AVX2 showed very little improvement.
We know that
I used
But other portion (outside vec_sse.h) are controlled by -msseX flag. SSE4.1 flag-enabled object has very good performance, maybe there is good SSE4.1 instruction for compiler.
I took benchmarks with some combinations of compiler flags. It is said that -free-vectorize option as default of -O3 option. No need to discuss about that, but it is interesting result about -ffast-math.
I thought 32bit/SSE(all) and 64bit/SSE(1~3) has poor performance and not usable but this option turns them good. Of course we have to care about side-effects of -ffast-math, but it is worth considering.
flags | A10/64bit | A8/64bit | Ath3000/64bit | A8/32bit | Ath3000/32bit |
---|---|---|---|---|---|
-O3 | 76.188 | 50.025 | 149.077 | 314.338 | 382.752 |
-O3 -msse | 76.073 | 49.667 | 149.702 | 223.131 | 322.325 |
-O3 -msse2 | 76.135 | 49.665 | 150.885 | 223.834 | 322.434 |
-O3 -msse3 | 76.152 | 49.732 | 151.59 | 216.926 | N/A |
-O3 -msse4.1 | 42.03 | 39.125 | N/A | 217.141 | N/A |
-O3 -mavx | 21.015 | 21.014 | N/A | 21.288 | N/A |
-O3 -mavx2 -mfma | N/A | N/A | N/A | N/A | N/A |
-O3 -ftree-vectorize | 76.362 | 49.869 | 157.634 | 311.65 | 382.083 |
-O3 -ftree-vectorize -msse | 76.321 | 49.69 | 151.02 | 223.675 | 336.876 |
-O3 -ftree-vectorize -msse2 | 76.3 | 49.82 | 153.214 | 222.964 | 326.009 |
-O3 -ftree-vectorize -msse3 | 76.224 | 49.976 | 151.591 | 216.132 | N/A |
-O3 -ftree-vectorize -msse4.1 | 42.006 | 39.314 | N/A | 215.871 | N/A |
-O3 -ftree-vectorize -mavx | 20.873 | 21.246 | N/A | 21.192 | N/A |
-O3 -ftree-vectorize -mavx2 -mfma | N/A | N/A | N/A | N/A | N/A |
-O3 -fno-tree-vectorize | 81.821 | 52.993 | 155.804 | 294.387 | 390.964 |
-O3 -fno-tree-vectorize -msse | 81.757 | 53.096 | 160.589 | 227.272 | 325.384 |
-O3 -fno-tree-vectorize -msse2 | 81.726 | 53.115 | 154.341 | 227.661 | 325.888 |
-O3 -fno-tree-vectorize -msse3 | 81.847 | 53.173 | 157.139 | 221.434 | N/A |
-O3 -fno-tree-vectorize -msse4.1 | 44.325 | 42.566 | N/A | 221.352 | N/A |
-O3 -fno-tree-vectorize -mavx | 22.751 | 24.297 | N/A | 25.116 | N/A |
-O3 -fno-tree-vectorize -mavx2 -mfma | N/A | N/A | N/A | N/A | N/A |
-O3 -DFLOAT_APPROX -ffast-math | 71.648 | 39.298 | 133.345 | 98.834 | 184.772 |
-O3 -DFLOAT_APPROX -ffast-math -msse | 71.752 | 39.466 | 127.423 | 44.555 | 142.49 |
-O3 -DFLOAT_APPROX -ffast-math -msse2 | 71.801 | 39.219 | 125.526 | 38.455 | 122.256 |
-O3 -DFLOAT_APPROX -ffast-math -msse3 | 71.569 | 39.528 | N/A | 38.666 | N/A |
-O3 -DFLOAT_APPROX -ffast-math -msse4.1 | 31.662 | 39.328 | N/A | 38.518 | N/A |
-O3 -DFLOAT_APPROX -ffast-math -mavx | 19.763 | 20.825 | N/A | 20.416 | N/A |
-O3 -DFLOAT_APPROX -ffast-math -mavx2 -mfma | N/A | N/A | N/A | N/A | N/A |
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize | 76.757 | 42.637 | 130.22 | 99.138 | 188.404 |
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -msse | 76.588 | 42.729 | 160.975 | 49.015 | 145.562 |
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -msse2 | 76.703 | 42.851 | 132.309 | 42.176 | 127.573 |
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -msse3 | 76.776 | 42.573 | 130.951 | 42.18 | 132.809 |
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -msse4.1 | 34.382 | 43.402 | N/A | 41.826 | N/A |
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -mavx | 21.751 | 23.563 | N/A | 23.206 | N/A |
-O3 -DFLOAT_APPROX -ffast-math -fno-tree-vectorize -mavx2 -mfma | N/A | N/A | N/A | N/A | N/A |
-Ofast -DFLOAT_APPROX | 68.393 | 39.127 | 126.724 | 98.815 | 188.336 |
-Ofast -DFLOAT_APPROX -msse | 68.206 | 39.334 | 132.759 | 44.71 | 138.427 |
-Ofast -DFLOAT_APPROX -msse2 | 68.135 | 39.099 | 126.228 | 38.914 | 123.181 |
-Ofast -DFLOAT_APPROX -msse3 | 67.567 | 39.546 | N/A | 39.035 | N/A |
-Ofast -DFLOAT_APPROX -msse4.1 | 28.1 | 39.23 | N/A | 38.701 | N/A |
-Ofast -DFLOAT_APPROX -mavx | 16.164 | 21.136 | N/A | 20.317 | N/A |
-Ofast -DFLOAT_APPROX -mavx2 -mfma | N/A | N/A | N/A | N/A | N/A |
I think candidates of compiler flags combination are
safe: 64bit AVX (-O3 -mavx) 64bit SSE4.1 (-O3 -msse4.1) 32bit AVX (-O3 -mavx)
expetimental: 64bit SSE2 (-O3 -DFLOAT_APPROX -ffast-math -msse2) 32bit SSE (-O3 -DFLOAT_APPROX -ffast-math -msse)
if there is no problem using -ffast-math, we can simplify them.
AVX mode: 64bit AVX (-O3 -mavx) 32bit AVX (-O3 -mavx)
SSE mode: 64bit SSE2 (-O3 -DFLOAT_APPROX -ffast-math -msse2) 32bit SSE (-O3 -DFLOAT_APPROX -ffast-math -msse)
We know that
says "I will write SSE1 intrinsics in my code". This does not mean using another intrinsics header file will automatically replace newer-version SSE instruction.
Of course. Including a header file does not do anything by itself. You need to actually use the functions declared in the header file to make use of newer SSE instruction sets.
But as I had already written, to be honest, I do not see what functions one would use to further vectorize the operations on 128-bit vectors of single-precision float
with less than AVX 1 (which is what introduced 256-bit vectors of float
(and other 256-bit vectors)). As I had already stated, I think that the compiler must be finding places to vectorize with SSE4.1 that are not covered by vec*.h
.
see issue #24