Inefficient code generated for NEON function computing GNU symbol hash

Quuxplusone commented 5 years ago


Bugzilla Link	PR43810
Status	NEW
Importance	P enhancement
Reported by	Ryan Prichard (rprichard@google.com)
Reported on	2019-10-25 16:29:31 -0700
Last modified on	2019-10-29 14:01:51 -0700
Version	trunk
Hardware	PC Linux
CC	llvm-bugs@lists.llvm.org, ndesaulniers@google.com, pirama@google.com, smithp352@googlemail.com, srhines@google.com, Ties.Stuij@arm.com
Fixed by commit(s)
Attachments	`gnuhash-v1.tar.gz` (10563 bytes, application/gzip) `bench_intrinsics_ndk21_r365631c.s` (6457 bytes, text/plain) `bench_intrinsics_gcc8.s` (5164 bytes, text/plain)
Blocks
Blocked by
See also

Created attachment 22728
Archive of GNU hash function implementions and build/run scripts

I wrote a NEON-optimized version of a function that computes the GNU hash value
for a symbol name, and Clang's version of the function is slower than what GCC
generates (or what I can do with hand-written assembly).

I'm not quite sure what LLVM is doing that's making it slower. I did notice
that my hand-written assembly doesn't create a stack frame, whereas both GCC
and Clang need one.

Details:

I'm working on making the Bionic dynamic linker's GNU hash calculation faster,
because it takes a significant portion of the total linker run-time. (At one
point, I measured it taking 20% of the total run-time doing the initial linking
of cameraserver.)

The linker currently uses a simple function to calculate the hash.

uint32_t SymbolName::gnu_hash() {
  if (!has_gnu_hash_) {
    uint32_t h = 5381;
    const uint8_t* name = reinterpret_cast<const uint8_t*>(name_);
    while (*name != 0) {
      h += (h << 5) + *name++; // h*33 + c = h + h * 32 + c = h + h << 5 + c
    }

    gnu_hash_ =  h;
    has_gnu_hash_ = true;
  }

  return gnu_hash_;
}

Using hand-written arm32 Neon assembly, I wrote a function that takes 30-50%
less time than the simple C++ version. Using C++ code with Neon intrinsics
instead, I can write something that's still faster than the simple C++ version,
but has about half the improvement when I compile with Clang. GCC, on the other
hand, gets much closer to my hand-written assembly.

Here are some numbers on an arm32-only Go phone. I used the "performance"
scaling governor. I used the https://tratt.net/laurie/src/multitime utility to
run benchmarks repeatedly and calculate confidence intervals.

Clang, simple C function: 0.441+/-0.0001 (in seconds of wall clock time)
GCC, simple C function: 0.376+/-0.0001
Clang, using Neon intrinsics: 0.373+/-0.0001 (Clang ignored pragma unroll)
GCC, using Neon intrinsics: 0.330+/-0.0001 (w/ no pragma GCC unroll)
GCC, using Neon intrinsics: 0.312+/-0.0003 (w/ pragma GCC unroll 8)
Handwritten assembly: 0.311+/-0.0001

I also looked at a walleye Pixel 2 device (core 4, one of the fast ones). For
arm32:

Clang, simple C function: 0.347+/-0.0023
GCC, simple C function: 0.323+/-0.0021
Clang, using Neon intrinsics: 0.225+/-0.0013
GCC, using Neon intrinsics: 0.208+/-0.0013 (w/ no pragma GCC unroll)
GCC, using Neon intrinsics: 0.186+/-0.0007 (w/ pragma GCC unroll 8)
Handwritten assembly: 0.176+/-0.0013

I don't have handwritten assembly for arm64, but I benchmarked the C++  code.

Clang, simple C function: 0.308+/-0.0017
GCC, simple C function: 0.285+/-0.0018
Clang, using Neon intrinsics: 0.205+/-0.0016 (Clang ignored pragma unroll)
GCC, using Neon intrinsics: 0.189+/-0.0010 (w/ no pragma GCC unroll)
GCC, using Neon intrinsics: 0.217+/-0.0015 (w/ pragma GCC unroll 4)
GCC, using Neon intrinsics: 0.214+/-0.0004 (w/ pragma GCC unroll 8)

I attached a tarball with the source code, Makefile, and a couple of scripts
for running the benchmarks via adb.

I also uploaded three assembly files:
 - my hand-crafted arm32 assembly
 - the output from NDK r21 beta 1's compiler (Clang as of r365631)
 - the output from arm-linux-gnueabi-gcc-8 8.3.0 from my gLinux machine

Quuxplusone commented 5 years ago

Attached gnuhash-v1.tar.gz (10563 bytes, application/gzip): Archive of GNU hash function implementions and build/run scripts

Quuxplusone commented 5 years ago

Attached bench_intrinsics_ndk21_r365631c.s (6457 bytes, text/plain): bench_intrinsics_ndk21_r365631c.s

Quuxplusone commented 5 years ago

Attached bench_intrinsics_gcc8.s (5164 bytes, text/plain): bench_intrinsics_gcc8.s

Quuxplusone commented 5 years ago

This bug was originally filed as http://b/139510013 inside Google.

Quuxplusone / LLVMBugzillaTest

Inefficient code generated for NEON function computing GNU symbol hash #42780