AVX2: enc: add inline asm codepath

Like was done for NEON32 in #91 and for NEON64 in #92, we can get a pretty large speedup for the AVX2 encoder by implementing the inner loop in inline assembly for compilers that support it. Testing on my machine (i5-4590S) with a proof-of-concept branch shows around a 33% speed improvement (!).

This is achieved in the same way that we handle the NEON encoders. Split the encoder into assembly "recipes" for translation and shuffling, interleave them with loads and stores, and keep three sets of data in flight in parallel inside large unrolled loops. It's basically the code that we'd hope the compiler would generate for us if it was clever enough.

The drawbacks I see to adopting this approach is an increase in complexity and transparency in this library, because generating inline assembly code with C macros is a bit gnarly. But on the other hand, those speed gains don't lie. And this would be purely additive: the codepath would only be taken on compilers that support it, and the normal implementation would remain available.

The advantage is the large speedup, of course, and also the fact that the implementation is not too crazy when it's laid side to side with the intrinsics version. It's basically the same algorithm in a different expression.

aklomp / base64

AVX2: enc: add inline asm codepath #104