Add SSE2 dec_loop.c - Githubissues

aqrit commented 6 years ago

Use saturation for range check.

This commit is wholly dead code. In the future, it would be useful anywhere the PSHUFB instruction is slow (https://github.com/aklomp/base64/pull/44) or not supported.

aqrit commented 6 years ago

I'm unsure about the best way to hook it up.

I would advocate for a minimalist cpu detection library (https://github.com/aklomp/base64/issues/13) maybe the one from the x265 project?

aklomp commented 4 years ago

What's interesting is that SSE2 is a mandatory part of the AMD64 architecture, so we could use this codec as a fallback instead of the plain decoder when we detect at compile time that we're compiling for AMD64.

As for CPU detection, I want to tackle this in the future, but was thinking about using GCC's builtin-cpu family of functions instead of maintaining something myself.

htot commented 4 years ago

As for CPU detection, I want to tackle this in the future, but was thinking about using GCC's builtin-cpu family of functions instead of maintaining something myself.

We might use CPU detection to tweak the algorithm for performance as well, see benchmarks.

aqrit commented 4 years ago

I wonder if the new packing code is slower on old (re: atom) hardware as well. It trades out logic/shift for mul.

Does someone want to try it with the old dec_reshuffle()

Also Atom should probably be using aligned loads.

edit: this is being built 64-bit right?

I assume ppl have seen the new competition? https://github.com/powturbo/TurboBase64/blob/master/index.md https://github.com/WojciechMula/base64-avx512

aklomp commented 4 years ago

I did see the new breakthroughs by @WojciechMula and @lemire, it's what piqued my interest in working on this library again. Their AVX512 codec is impressive work. I'd like to add it to this library some day, but the problem is that it currently requires exotic hardware to test and verify the code. So it's something for the longer term.

I did see @powturbo's code. While their plain codec is indeed faster than our plain codec, their other claims don't really hold water. I did the benchmarks myself and SIMD still wins in the end. Notice that our NEON codecs are mysteriously missing from their benchmarks, and the AVX2 codecs only get a mention but no numbers.

aqrit commented 4 years ago

this patch needs tweaking for efficient operation with only 8 xmm registers

htot commented 4 years ago

@powturbo benchmarks say our decoding is slower than our coding, the opposite to our own benchmarks. Somewhere something must be wrong.

aklomp commented 4 years ago

I pushed a quick-and-dirty branch called turbobase64 which integrates the TurboBase64 codec into our benchmarks. You can build it yourself or see some results in the CI here. I didn't hook their codec up to our test suite yet.

The result: their plain codec is faster than ours (if it also does all necessary validity checks), but is no match for the SIMD codecs by a long shot. Don't believe everything you read on the Internet.

Agreed about the decoding anomaly, our decoder should be faster than the encoder. I'd like to see a reproducible process to generate those benchmarks.

powturbo commented 4 years ago

Well, TurboBase64 is a small experiment existing since 2016. It's strange that you've not done any attention since there. The fastbase people have also tested my library and not complained about the results. Nobody has included turbobase64 in benchmarks since there.

I've done primarly the benchmarks against fastbase64 that included other base64 functions form chromium, apple quicktime, linux,... At the end, I've included the SSE benchmark from your library. Recently, I've adapted turbobase64 to ARM and done some benchmarks against the Neon implementation in fastbase64. I've not redone all other benchmarks (same cpu). As your library seems not maitained since a long time, I'm not aware of any improvement done. At the time of benchmarking the results where 100% correct and effectively scalar was faster than other scalar libraries and decoding even faster that SSE with the hardware and compilers at this time. You are invited to verify the benchmark with the same conditions. The goal is not to propose a library faster than everything, but to show that scalar can also be improved to be possibly competitive with or faster than other SIMD implementations. I can simply include the famous SIMD work of Wojciech Muła, if I want to add other SSE, AVX2 functions to turbobase64. Detecting cpu used and switching between the functions is a trivial work.

Edit: Your CI results are showing, turbobase64 is decoding 2x faster than Base64 SIMD Neon.

aklomp commented 4 years ago

Hi @powturbo, nice to see you've joined the discussion! You've done some impressive work on your library. Your plain codec is a speed monster, and I'm a bit envious of it actually. I just think you're overselling its speed a bit when compared to codecs that use a SIMD approach. It's like saying you have the fastest bicycle in a world where motorcycles exist. It's still a great achievement, and it has the potential to become a very portable, "fast everywhere" library.

I was unaware of your codec until your comments on HN some weeks ago. Then again, as you point out, I haven't been very active in the Base64 field since 2016. Many other hobbies to explore and lives to lead, etc. I've decided to start working on this library again because I left a bunch of loose ends a few years ago and I want to clean them up as best I can, so that I can hopefully "complete" this library and move on in good conscience.

I did notice that your decoder does not check the validity of the input. Not every string is a valid Base64 string. That's perhaps fine for the library's intended use cases, but it's also a shortcut that makes it easier to create a fast table-based algorithm. I was unable to hook up your decoder to our test suite because it does not output such an "input is bad" flag.

As for the issue of benchmarking, it's interesting that your benchmark concludes that our decoding is slower than our encoding. Normally it's the other way around, and when looking at the code it should be obvious why. I'd like to see why this is, and rerun the benchmarks, but I can't seem to find your benchmarking code.

You're right about the situation with Neon64. I noticed that myself when I added cross-platform CI builds a few days ago. (I don't own a Neon64-capable device that I can test on.) I'd like to know what hardware Travis CI uses for their Aarch64 machines. It's possible that it's not even real ARM hardware, but emulation, in which case the results will naturally be poor. Note that the Neon32 benchmarks are for sure run through Qemu (by our CI script), so they are not indicative of actual performance.

If true, the Neon64 numbers are worrying, because those codecs apparently have very good performance on the iPhone. Maybe the performance hit is due to poor code generation by the compiler, or something else like register pressure. I'm considering parts of the codec in inline assembler to see if it makes a difference. I'm also considering buying a cheap Aarch64 board so that I have something to test on. Until that time, your library looks to be much faster on that platform.

htot commented 4 years ago

I'd like to know what hardware Travis CI uses for their Aarch64 machines.

lxd-arm64 → the build ran within an LXD container on Arm64-based infrastructure (currently delivered by Packet) acc. to docs

mayeut commented 4 years ago

@aklomp, the poor performances in NEON are probably due to the differences in architectures between an iPhone (very good at lookups and vldX/vstX in NEON) and the one used by travis-ci (I don't know if they use ThunderX or Ampere chips, but I know that ThunderX is bad for those operations). libjpeg-turbo uses different implementations depending on the actual architecture.

powturbo commented 4 years ago

Hi @aklomp klomp,

thank you.

I'm only interpreting the benchmark numbers and effectively other SSE base64 at the benchmark time wasn't so fast. Scalar was faster, but you may get other results depending on the hardware used, because the difference was not so big (see benchmarks). I've also benchmarked base64simd and the results were similar.

It's not unusual to have scalar beating (unoptimized) sse and even avx2 (see ex. https://github.com/powturbo/TurboHist ).

As stated in the readme, the benchmarks are done with turbobench. 1 - clone turbobench 2 - Download base64, turbobase64 and fastbase64 into the turbobench directory. 3 - Compile base64 library with the provided base64 makefile. (output libbase64.o) 4 - Type "make BASE64=1" under the turbobench directory 5 - Type "./turbobench -eturbob64/b64_ssse3/b64_sse41 file -I15 -J15" When, you use the base64 benchmarked version (I think beginn 2017), you must see that decoding is a lot slower than encoding for base64 sse. You can verify this also with your own benchmark tools.

travis is using (Ampere eMAG) packet ARM servers See Review

powturbo commented 4 years ago

Hi, I've benchmarked now base64 using the latest version including avx2. In my benchmarks with the cpu used, sse encoding is still faster than decoding.

htot commented 4 years ago

Now doubt there is a difference in the measurement method. We already found there are bounds depending on buffer size, threads used, CPU type, memory speed. I imagine that with your benchmark the effect of the cache is eliminated (large buffer, random data). This might be comparable to Edison, which has no cache at all. And indeed with Edison SSSE3 encoding is faster then decoding. See (my slightly outdated) benchmarks. If so, turning on multi-threading will speed up plain, but not change SSSE3 and AVX2.

mayeut commented 4 years ago

Now doubt there is a difference in the measurement method.

Sure, base64 benchmark uses input size for both function (decoded/raw byte count for encode, encoded byte count for decode) while the turbobase64 bench seems to always use the decoded/raw byte count.

aklomp commented 4 years ago

@powturbo Thanks for doing the benchmarks. The difference between the encoder and decoder is more in line now with what I would expect. If you ran the tests on a fresh checkout, it may be because of some codec improvements.

I'm still impressed by the speed of your plain codec. I've been playing with my own 12-bit lookup table code over the past few days and found it impossible to beat yours, even when I pull out all the tricks like loop unrolling. I'll push that code soon because halving the number of table lookups makes a massive performance difference.

The benchmarking in this library was always a bit of an afterthought. The benchmarking program is quickly thrown together without much regard to eliminating forms of bias, or much thought about what to measure to get meaningful and repeatable results. It should probably be on the longlist of things that need work. The variance in the results can be frustrating. Sometimes it's hard to tell whether a change is really an improvement or a regression.

Codec performance varies enormously across processors, that's for sure. Code that is fast on one architecture is slow on another, as demonstrated most clearly by the NEON64 code. I have plans to add a new optional codec_choose backend which uses GCC's __builtin_cpu runtime detection to determine x86 architecture at runtime. That should make it possible to have fine-grained control over which codec runs where. (And this closes the circle with the topic of this PR :)

(In other news, I just released v0.4.0. It's a release of the status quo, so that existing users can anchor to this version while we have the freedom to make potentially breaking changes.)

aklomp / base64

Add SSE2 dec_loop.c #46