aklomp / base64

Fast Base64 stream encoder/decoder in C99, with SIMD acceleration
BSD 2-Clause "Simplified" License
866 stars 162 forks source link

NEON32: enc: add inline asm codepath #91

Closed aklomp closed 2 years ago

aklomp commented 2 years ago

Some testing on my Raspberry Pi 2B 1.1 shows that GCC and Clang both generate pretty terrible code from neon intrinsics.

For the NEON32 encoder, which is simpler than the x86 encoders, the speed can be substantially improved by hand-coding the relatively simple inner loop in inline assembly. A quick proof-of-concept shows that inline assembly gets around 382 MB/s on GCC, against 209 MB/s for the status quo. Clang does worse and better at the same time, getting 304 MB/s for the inline assembly and 294 MB/s for the status quo. Both are an improvement, so I think this should be added.

aklomp commented 2 years ago

Tests on the merged code on my Pi show that GCC and Clang now both reach 382 MB/s.