Some testing on my Raspberry Pi 2B 1.1 shows that GCC and Clang both generate pretty terrible code from neon intrinsics.
For the NEON32 encoder, which is simpler than the x86 encoders, the speed can be substantially improved by hand-coding the relatively simple inner loop in inline assembly. A quick proof-of-concept shows that inline assembly gets around 382 MB/s on GCC, against 209 MB/s for the status quo. Clang does worse and better at the same time, getting 304 MB/s for the inline assembly and 294 MB/s for the status quo. Both are an improvement, so I think this should be added.
Some testing on my Raspberry Pi 2B 1.1 shows that GCC and Clang both generate pretty terrible code from neon intrinsics.
For the NEON32 encoder, which is simpler than the x86 encoders, the speed can be substantially improved by hand-coding the relatively simple inner loop in inline assembly. A quick proof-of-concept shows that inline assembly gets around 382 MB/s on GCC, against 209 MB/s for the status quo. Clang does worse and better at the same time, getting 304 MB/s for the inline assembly and 294 MB/s for the status quo. Both are an improvement, so I think this should be added.