build fails with SIMD checksum optimizations

eworm-de commented 4 years ago

Building 3.2.0pre2 in current Arch Linux environment I get:

g++ -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -D_FORTIFY_SOURCE=2 -c -o simd-checksum-x86_64.o simd-checksum-x86_64.cpp
{standard input}: Assembler messages:
{standard input}:280: Error: symbol `_ZL21get_checksum1_avx2_64PaiiPjS0_' is already defined
{standard input}:301: Error: symbol `_ZL21get_checksum1_sse2_32PaiiPjS0_' is already defined
make: *** [Makefile:128: simd-checksum-x86_64.o] Error 1

benrubson commented 4 years ago

Some issues on Debian9 with SIMD :

g++ -g -O2  -c -o simd-checksum-x86_64.o simd-checksum-x86_64.cpp
simd-checksum-x86_64.cpp: In function ‘int32_t get_checksum1_sse2_32(signed char*, int32_t, int32_t, uint32_t*, uint32_t*)’:
simd-checksum-x86_64.cpp:161:40: error: ‘__m128i_u’ was not declared in this scope
         __m128i ss1 = _mm_loadu_si128((__m128i_u*)x);
                                        ^~~~~~~~~
simd-checksum-x86_64.cpp:161:50: error: expected primary-expression before ‘)’ token
         __m128i ss1 = _mm_loadu_si128((__m128i_u*)x);
                                                  ^
simd-checksum-x86_64.cpp:163:50: error: expected primary-expression before ‘)’ token
         __m128i ss2 = _mm_loadu_si128((__m128i_u*)x);
                                                  ^
simd-checksum-x86_64.cpp:166:53: error: expected primary-expression before ‘)’ token
         __m128i mul_t1 = _mm_loadu_si128((__m128i_u*)mul_t1_buf);
                                                     ^
simd-checksum-x86_64.cpp:178:52: error: expected primary-expression before ‘)’ token
                 in8_1 = _mm_loadu_si128((__m128i_u*)&buf[i]);
                                                    ^
simd-checksum-x86_64.cpp:179:52: error: expected primary-expression before ‘)’ token
                 in8_2 = _mm_loadu_si128((__m128i_u*)&buf[i + 16]);
                                                    ^
simd-checksum-x86_64.cpp:181:51: error: expected primary-expression before ‘)’ token
                 in8_1 = _mm_load_si128((__m128i_u*)&buf[i]);
                                                   ^
simd-checksum-x86_64.cpp:182:51: error: expected primary-expression before ‘)’ token
                 in8_2 = _mm_load_si128((__m128i_u*)&buf[i + 16]);
                                                   ^
simd-checksum-x86_64.cpp:250:36: error: expected primary-expression before ‘)’ token
         _mm_store_si128((__m128i_u*)x, ss1);
                                    ^
simd-checksum-x86_64.cpp:252:36: error: expected primary-expression before ‘)’ token
         _mm_store_si128((__m128i_u*)x, ss2);
                                    ^
simd-checksum-x86_64.cpp: In function ‘int32_t get_checksum1_avx2_64(signed char*, int32_t, int32_t, uint32_t*, uint32_t*)’:
simd-checksum-x86_64.cpp:281:43: error: ‘__m256i_u’ was not declared in this scope
         __m256i ss1 = _mm256_lddqu_si256((__m256i_u*)x);
                                           ^~~~~~~~~
simd-checksum-x86_64.cpp:281:53: error: expected primary-expression before ‘)’ token
         __m256i ss1 = _mm256_lddqu_si256((__m256i_u*)x);
                                                     ^
simd-checksum-x86_64.cpp:283:53: error: expected primary-expression before ‘)’ token
         __m256i ss2 = _mm256_lddqu_si256((__m256i_u*)x);
                                                     ^
simd-checksum-x86_64.cpp:287:56: error: expected primary-expression before ‘)’ token
         __m256i mul_t1 = _mm256_lddqu_si256((__m256i_u*)mul_t1_buf);
                                                        ^
simd-checksum-x86_64.cpp:293:55: error: expected primary-expression before ‘)’ token
                 in8_1 = _mm256_lddqu_si256((__m256i_u*)&buf[i]);
                                                       ^
simd-checksum-x86_64.cpp:294:55: error: expected primary-expression before ‘)’ token
                 in8_2 = _mm256_lddqu_si256((__m256i_u*)&buf[i + 32]);
                                                       ^
simd-checksum-x86_64.cpp:296:54: error: expected primary-expression before ‘)’ token
                 in8_1 = _mm256_load_si256((__m256i_u*)&buf[i]);
                                                      ^
simd-checksum-x86_64.cpp:297:54: error: expected primary-expression before ‘)’ token
                 in8_2 = _mm256_load_si256((__m256i_u*)&buf[i + 32]);
                                                      ^
simd-checksum-x86_64.cpp:372:39: error: expected primary-expression before ‘)’ token
         _mm256_store_si256((__m256i_u*)x, ss1);
                                       ^
simd-checksum-x86_64.cpp:374:39: error: expected primary-expression before ‘)’ token
         _mm256_store_si256((__m256i_u*)x, ss2);
                                       ^
Makefile:128: recipe for target 'simd-checksum-x86_64.o' failed
make: *** [simd-checksum-x86_64.o] Error 1

# g++ --version
g++ (Debian 6.3.0-18+deb9u1) 6.3.0 20170516

No issue with Debian10 (g++ (Debian 8.3.0-6) 8.3.0). Don't know then whether or not we must pay attention to old g++ versions...

eworm-de commented 4 years ago

Arch Linux has:

# g++ --version
g++ (GCC) 10.1.0

eworm-de commented 4 years ago

Our build environment has CXXFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fno-plt" by default. Build succees if I set this to CXXFLAGS="-mavx2 -mtune=generic -O2 -pipe -fno-plt". No idea what a correct fix should look like.

WayneD commented 4 years ago

One thing I noticed was that a CXXFLAGS tweak in configure was not getting added (due to my tweaking the logic in that section). I made it add these extra flags once again:

    CXXFLAGS="$CXXFLAGS -fno-exceptions -fno-rtti"

Does that help at all? Seems doubtful for the listed compile issue.

Feel free to configure with --disable-simd to continue your rsync testing. Hopefully we can figure out a rule for fixing this up, though.

hhoffstaette commented 4 years ago

One thing I noticed was that a CXXFLAGS tweak in configure was not getting added (due to my tweaking the logic in that section). I made it add these extra flags once again:
    CXXFLAGS="$CXXFLAGS -fno-exceptions -fno-rtti"
Does that help at all? Seems doubtful for the listed compile issue.

It does not (also using gcc-10 with default configure run):

g++ -g -O2 -fno-exceptions -fno-rtti  -c -o simd-checksum-x86_64.o ./simd-checksum-x86_64.cpp
/tmp/cc4A4Qzb.s: Assembler messages:
/tmp/cc4A4Qzb.s:1835: Error: symbol `_ZL21get_checksum1_avx2_64PaiiPjS0_' is already defined
/tmp/cc4A4Qzb.s:1856: Error: symbol `_ZL21get_checksum1_sse2_32PaiiPjS0_' is already defined
make: *** [Makefile:128: simd-checksum-x86_64.o] Error 1

Trying to building with clang (10.0) gives better error messages:

clang++ -g -O2 -c -o simd-checksum-x86_64.o ./simd-checksum-x86_64.cpp
./simd-checksum-x86_64.cpp:123:59: error: function declaration cannot become a multiversioned function after first usage
__attribute__ ((target("default"))) static inline __m128i sse_interleave_odd_epi16(__m128i a, __m128i b) { return a; }
                                                          ^
./simd-checksum-x86_64.cpp:124:59: error: function declaration cannot become a multiversioned function after first usage
__attribute__ ((target("default"))) static inline __m128i sse_interleave_even_epi16(__m128i a, __m128i b) { return a; }
                                                          ^
./simd-checksum-x86_64.cpp:125:59: error: function declaration cannot become a multiversioned function after first usage
__attribute__ ((target("default"))) static inline __m128i sse_mulu_odd_epi8(__m128i a, __m128i b) { return a; }
                                                          ^
./simd-checksum-x86_64.cpp:126:59: error: function declaration cannot become a multiversioned function after first usage
__attribute__ ((target("default"))) static inline __m128i sse_mulu_even_epi8(__m128i a, __m128i b) { return a; }
                                                          ^
./simd-checksum-x86_64.cpp:154:17: error: 'target' attribute takes one argument
__attribute__ ((target("sse2", "ssse3"))) static int32 get_checksum1_sse2_32(schar* buf, int32 len, int32 i, uint32* ps1, uint32* ps2)
                ^
./simd-checksum-x86_64.cpp:385:17: warning: attribute declaration must precede definition [-Wignored-attributes]
__attribute__ ((target("default"))) static int32 get_checksum1_sse2_32(schar* buf, int32 len, int32 i, uint32* ps1, uint32* ps2)
                ^
./simd-checksum-x86_64.cpp:154:56: note: previous definition is here
__attribute__ ((target("sse2", "ssse3"))) static int32 get_checksum1_sse2_32(schar* buf, int32 len, int32 i, uint32* ps1, uint32* ps2)
                                                       ^
./simd-checksum-x86_64.cpp:385:50: error: redefinition of 'get_checksum1_sse2_32'
__attribute__ ((target("default"))) static int32 get_checksum1_sse2_32(schar* buf, int32 len, int32 i, uint32* ps1, uint32* ps2)
                                                 ^
./simd-checksum-x86_64.cpp:154:56: note: previous definition is here
__attribute__ ((target("sse2", "ssse3"))) static int32 get_checksum1_sse2_32(schar* buf, int32 len, int32 i, uint32* ps1, uint32* ps2)
                                                       ^
1 warning and 6 errors generated.

Feel free to configure with --disable-simd to continue your rsync testing.

That worked better.. and then failed on the man page. :}

WayneD commented 4 years ago

If you mean the help-FOO.h extractions from the man pages, the latest master has switched to an awk script and tweaked the Makefile to avoid using $<. Try grabbing it via git or the .zip download and give that a try.

If you mean the md2man script, it requires one of the python3 libs mentioned in the NEWS.md file.

benrubson commented 4 years ago

Does that help at all? Seems doubtful for the listed compile issue.

Does not help for the Debian9 compile issue listed above.

benrubson commented 4 years ago

On FreeBSD there's no g++.

And with clang++ 8.0.1, we face same sort of errors as above. Details : https://pastebin.com/raw/e0DAinJx

~But seems to works flawlessly with clang-cpp :~ ~(...)~

hhoffstaette commented 4 years ago

But seems to works flawlessly with clang-cpp :

That doesn't do what you think, clang-cpp is the C preprocessor and the resulting .o file is invalid (not an object file).

hhoffstaette commented 4 years ago

So there are a couple of things to unpack here.

1) Unlike gcc, clang does not like multiple target arguments (line 154); they must be passed as one string containing comma-separated values. Apparently that would also work for gcc (see GCC Wiki).

2) Depending on passed target arch (-mtune or -march) you'll get two, one or no errors:

$g++ -march=athlon64 -pipe -O2 -fno-exceptions -fno-rtti  -c -o simd-checksum-x86_64.o ./simd-checksum-x86_64.cpp   
{standard input}: Assembler messages:
{standard input}:383: Error: symbol `_ZL21get_checksum1_avx2_64PaiiPjS0_' is already defined
{standard input}:404: Error: symbol `_ZL21get_checksum1_sse2_32PaiiPjS0_' is already defined

$g++ -march=corei7 -pipe -O2 -fno-exceptions -fno-rtti  -c -o simd-checksum-x86_64.o ./simd-checksum-x86_64.cpp
{standard input}: Assembler messages:
{standard input}:170: Error: symbol `_ZL21get_checksum1_avx2_64PaiiPjS0_' is already defined

--> core-i7 (or core-i7-avx) aka SandyBridge has AVX but *not* AVX2, so apparently the backend
doesn't recognize this as a multiversion symbol, ultimately sees two function definitions and
'correctly' complains.

$g++ -march=core-avx2 -pipe -O2 -fno-exceptions -fno-rtti  -c -o simd-checksum-x86_64.o ./simd-checksum-x86_64.cpp 
$ \o/

Offhand I'm not sure what to do here (not to mention clang compatibility). I tried a crafty trick by declaring get_checksum1_avx2_64 as ((target("tune=core-avx2"))) in order to avoid having to explicitly pass a -march, but of course got outsmarted:

./simd-checksum-x86_64.cpp: In function '_ZL21get_checksum1_avx2_64PaiiPjS0_.resolver':
./simd-checksum-x86_64.cpp:273:57: error: ISA 'tune=core-avx2' is not supported in 'target' attribute, use 'arch=' syntax
  273 | __attribute__ ((target("tune=core-avx2"))) static int32 get_checksum1_avx2_64(schar* buf, int32 len, int32 i, uint32* ps1, uint32* ps2)
      |                                                         ^~~~~~~~~~~~~~~~~~~~~

3) Finally the real problem: after explicitly building only simd-checksum.cpp with -march=core-avx2 and linking, I benchmarked the whole thing and .. got almost exactly the same performance (a few hundred KB/s more) as my existing rsync-3.1.3 built with -O2 -march=native -funroll-loops, clocking in at ~477 MB/s for a multi-GB whole-file sync on tmpfs. Unless I'm doing something wrong here this IMHO just does not seem to be worth the inevitable widespread build trouble for distros, and plain old gcc vectorization (via -O3) and some loop unrolling will do just fine for most people. If anybody has smarter ideas or a benchmark that yields radically different results with hand-rolled SIMD enabled, please post! Otherwise.. ¯\_(ツ)_/¯

benrubson commented 4 years ago

You could also --enable-openssl to benchmark using the OpenSSL hash algorithm version, could be interesting, vs SIMD. Otherwise... xxhash is certainly the new way to go...

WayneD commented 4 years ago

You don't need the enable -- it uses openssl if the dev lib is found. You just need --disable-openssl if you don't want to use it.

Which reminds me, if you're using both, then the MD5 routine is coming from openssl & only the checksum1 routine is coming from the SIMD code. You'd need to --disable-openssl to test the SIMD version of MD5.

WayneD commented 4 years ago

As for the speed, one of the big targets of the SIMD stuff the rolling checksum1 computation, which does not get called with a whole-file sync. For a normal file transfer the speed of the full-file transfer checksum is usually hidden in all the I/O. The more checksum comparisons it does to find matching blocks (which usually takes larger files) the more it should speed up. Using xxhash as the checksum choice lets it get used for the checksum2 routine, which is probably the biggest win you can see.

I ran some speed tests a while ago where I changed the generator's call to file_checksum() to be in a 10000-iteration loop on a small file and only saw around a 10% improvement in the MD5 routines (around 3-4 seconds shaved from 35). Compare that to xxhash, which was about 6x faster. The MD4 version in openssl was about 28% faster than the one in rsync, though.

hhoffstaette commented 4 years ago

which does not get called with a whole-file sync.

Yes, but I mostly use whole-file.. ;) Decided to activate xxhash as well, and with --checksum-choice=xxh64 I now repeatably get ~1.72GB/s locally on tmpfs, which is good enough. Network transfers still won't go over ~320 MB/s though (or ~660 with ssh-hpn and none-cipher - actually not that bad for 10Gb), thanks to ssh. :anchor: For what it's worth, xxhash almost doubles in speed from ~10-12 GB/s with -O2 to >20 (!) when built with -O3, so there's still room for improvement.. :smile: :rocket: Anyway, you're right that xxhash will be a huge improvement, so I'm not really sure this C++/SIMD thing is really worth the build trouble. The md5-asm stuff is nice though and seems unproblematic.

WayneD commented 4 years ago

Note that you don't need the --cc=xxh64 in most cases because it's the default hash when present (unless overridden).

The SIMD thing does seem to have some portability issues at the moment, but the current configure only tries to cajole the caller to install g++ on Linux now, and anyone who has g++ and has an issue compiling it can use ./configure --disable-simd to turn it off, so it seems pretty reasonable to me.

WayneD commented 4 years ago

If folks disagree and think it should be off except when someone chooses to use --enable-simd, feel free to let me know.

WayneD commented 4 years ago

I've actually changed my mind. I figure if we go out with a lot of folks needing to manually disable it, then it will be harder to get it enabled when it is more portable. I've switched it back to manually enabled for now, and we can see if we can get it more portable for the next release.

benrubson commented 4 years ago

I figure if we go out with a lot of folks needing to manually disable it, then it will be harder to get it enabled when it is more portable.

I agree.

we can see if we can get it more portable for the next release.

Perhaps @Chainfire (contributor for this patch) has some clues to "quickly" fix this ?

hhoffstaette commented 4 years ago

It should be enough to add -march=core-avx2 as last argument in CXXFLAGS (see my comment above), but only to simd-checksum-x86_64.cpp. The rest should be built as configured and according to the target platform.

eworm-de commented 4 years ago

Well, simd-checksum-x86_64.cpp is the only code built with g++... So you can just set CXXFLAGS. Plain C files are built with CFLAGS.

benrubson commented 4 years ago

It should be enough to add -march=core-avx2 as last argument in CXXFLAGS (see my comment above)

I gave a try with your command on FreeBSD (clang 8.0.1) : clang++ -march=core-avx2 -pipe -O2 -fno-exceptions -fno-rtti -c -o simd-checksum-x86_64.o ./simd-checksum-x86_64.cpp I still get the same errors are above. And on Debian9 (g++ 6.3.0), same errors as above. Does not help unfortunately :|

hhoffstaette commented 4 years ago

I gave a try with your command on FreeBSD (clang 8.0.1)

I didn't mention it explicitly in my previous comments, but I could not even get the SIMD code to build with clang - this is definitely a code problem. While clang generally supports multiversioning (see here), there are apparently "slight differences" to gcc. After fixing the multi-argument string from ((target("sse2", "ssse3"))) to ((target("sse2,ssse3"))) (a single argument string) and moving the dummy target("default") sse functions above their target(sse) versions it will compile (even without explicit -march), but then fail when linking. Couldn't figure out what that was about last night. So: no manual SIMD with clang until the code gets fixed. You will still get regular auto-vectorization with clang at -O2 or higher though, so it's not completely slow.

Chainfire commented 4 years ago

I'll have a look at this. I don't remember why but I thought rsync was gcc-only, hence no clang testing was done at all. Obviously this was wrong and should be corrected. Until it does, I agree the flag should be disabled. I'll have a test round on a number of different distros and gcc/clang versions and see what pops up.

Don't know then whether or not we must pay attention to old g++ versions...

The instructions themselves should work from GCC 4.8 or so onwards. Seems more like a header issue in your case.

Build succees if I set this to CXXFLAGS="-mavx2 -mtune=generic -O2 -pipe -fno-plt"

-m shouldn't be needed at all, that's the whole idea of the target attributes

I made it add these extra flags once again:
    CXXFLAGS="$CXXFLAGS -fno-exceptions -fno-rtti"

That shouldn't have any influence on this issue, but these flags should definitely be included to avoid a dependency on libstdc++ which is a whole different can of worms you want to avoid if at all possible.

Unlike gcc, clang does not like multiple target arguments (line 154); they must be passed as one string containing comma-separated values. Apparently that would also work for gcc (see GCC Wiki).

I guess I'm overlooking it because I don't see that mentioned in the link. But I'll give it a shot.

Depending on passed target arch (-mtune or -march) you'll get two, one or no errors:

You shouldn't need to pass -m at all, the error is likely something else.

... I benchmarked the whole thing and ... Unless I'm doing something wrong here this IMHO just does not seem to be worth the inevitable widespread build trouble for distros, and plain old gcc vectorization (via -O3) and some loop unrolling will do just fine for most people.

What you're doing doesn't even run this code so it's no surprise you see no difference; the relevant code if used is a bottleneck on low-end CPUs with fast disks and network, though (common on consumer level NAS).

I ran some speed tests a while ago where I changed the generator's call to file_checksum() to be in a 10000-iteration loop on a small file and only saw around a 10% improvement in the MD5 routines (around 3-4 seconds shaved from 35). Compare that to xxhash, which was about 6x faster. The MD4 version in openssl was about 28% faster than the one in rsync, though.

For the md5-asm submit you shouldn't expect anything beyond that 10%. It's not SIMD, it's just the ASM version of MD5 also present in older versions of OpenSSL (that specific bit of ASM is public domain rather than OpenSSL licensed), so those who don't build with OpenSSL for whatever reason still get that speed. It's not parallelized and really has nothing to do with SIMD, other than that the CPU target is the same and was hidden behind the same switch for convenience as compatibility should be virtually the same.

I did submit a SIMD optimized MD5 as well, but this is still in patches (and will likely suffer from the exact same compilation issues as discussed here, so those need to be figured out first). That version should show up to 6x speed increase (depending on CPU) compared to md5-asm already in master.

but the current configure only tries to cajole the caller to install g++ on Linux now

Out of curiosity, as I read the configure script (haven't actually run it without g++) it tells the user to run with --disable-simd, why is it not just disabled automatically, as was the intent in my original submit? It makes sense to throw this error if the flag is manual enable, but if its auto-enabled then detection should do its thing.

Anyway, you're right that xxhash will be a huge improvement, so I'm not really sure this C++/SIMD thing is really worth the build trouble.

Using xxhash doesn't circumvent this code, it circumvents the MD5 code.

Chainfire commented 4 years ago

I don't remember why but I thought rsync was gcc-only, hence no clang testing was done at all.

I've gone back to my notes and apparently I already knew it wouldn't work on clang, hence it checking specifically for g++, and in my original submission SIMD being automatically disabled if the c++ compiler wasn't present or not g++.

After some more digging, clang++ has supported the target attribute for a while, but it didn't automatically dispatch the correct function like gcc does, so it wouldn't work anyway. It seems the latest versions of clang++ have brought support up to par with g++, but I haven't actually tested that yet. Even if the code can be made compatible with clang++, version testing will be needed.

That's not something I like doing, but seeing some of the build errors here it may need to be done even for g++.

Either way, there will be cases where configure will find SIMD cannot be enabled, and I stand by my previous statement that in that case it should automatically be disabled, it should only throw an error if --enable-simd was specifically passed to configure (so if its default enabled, then there shouldn't be an error, just a silent disable).

Chainfire commented 4 years ago

Pull request with the fixes. Tested on several distros with a wide range of gcc and clang versions.

It would be helpful if you guys can confirm this solves the build problems for you, @eworm-de @benrubson @hhoffstaette

eworm-de commented 4 years ago

I guess we can close this with 4f539ccf21c173b97f310bb9f80d2cbedfe11e7d in master?

Chainfire commented 4 years ago

That just broke FreeBSD build. If that is corrected and @benrubson confirms his issue is solved ?

EDIT: That breakage should be fixed now. So just waiting for Ben to try to build the current master branch with --enable-simd, I guess.

benrubson commented 4 years ago

Both Debian9 (gcc 6.3.0) and FreeBSD12.1 (clang 8.0.1) which failed above now correctly build :+1: I also gave clang 10 a try on FreeBSD12.1, no issue (clang 10 is now the default since FreeBSD 11.4, should then be on 12.2). Very good news, many thanks @Chainfire 👍

@WayneD, let's then enforce --enable-simd (and ask to disable it if prerequisites are not found), as with the other new options ?

WayneD commented 4 years ago

I'll consider possibly making it default to enabled. Not sure yet. I've done a bunch of compiling on various host types, and the final release of 3.2.0 is getting pretty close now.

RsyncProject / rsync

build fails with SIMD checksum optimizations #13