ARMv7a: inefficient code generated from memcpy + bswap builtin

Quuxplusone commented 3 years ago


Bugzilla Link	PR51621
Status	NEW
Importance	P enhancement
Reported by	Sami Liedes (sami.liedes@iki.fi)
Reported on	2021-08-25 06:26:57 -0700
Last modified on	2021-08-25 06:43:02 -0700
Version	trunk
Hardware	PC Linux
CC	david.green@arm.com, llvm-bugs@lists.llvm.org, smithp352@googlemail.com, Ties.Stuij@arm.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Also applies to: armv7-a clang 11.0.1

Godbolt: https://godbolt.org/z/9f1GKf5qe

Consider this code:

----------
#include <stdint.h>

uint32_t read_unaligned_memcpy_bswap_32(const uint8_t *buf, int offset) {
    uint32_t val;
    __builtin_memcpy(&val, buf+offset, 4);
    return __builtin_bswap32(val);
}

uint32_t read_unaligned_shift_add_32(const uint8_t *buf, int offset) {
    return (((uint32_t)buf[offset]) << 24) +
           (((uint32_t)buf[offset+1]) << 16) +
           (((uint32_t)buf[offset+2]) << 8) +
           (((uint32_t)buf[offset+3]) << 0);
}
----------

On many architectures, eg. ARMv8, these produce identical and efficient code.
On ARMv7a, __builtin_bswap32 version produces what looks like *worse* code
compared to the shift+add version (although I admit I don't know the
architecture well enough to be sure, but at least the result has 14
instructions as opposed to 8):

read_unaligned_memcpy_bswap_32(unsigned char const*, int):
  ldrb r1, [r0, r1]!
  ldrb r2, [r0, #1]
  ldrb r3, [r0, #2]
  ldrb r0, [r0, #3]
  orr r1, r1, r2, lsl #8
  orr r0, r3, r0, lsl #8
  mov r2, #16711680
  orr r0, r1, r0, lsl #16
  mov r1, #65280
  and r1, r1, r0, lsr #8
  and r2, r2, r0, lsl #8
  orr r1, r1, r0, lsr #24
  orr r0, r2, r0, lsl #24
  orr r0, r0, r1
  bx lr

read_unaligned_shift_add_32(unsigned char const*, int):
  ldrb r1, [r0, r1]!
  ldrb r2, [r0, #1]
  ldrb r3, [r0, #2]
  ldrb r0, [r0, #3]
  lsl r2, r2, #16
  orr r1, r2, r1, lsl #24
  orr r1, r1, r3, lsl #8
  orr r0, r1, r0
  bx lr

The same applies to the 16-bit version (see the Godbolt link for code), but the
difference is much less dramatic (also there trunk generates one instruction
more compared to 11.0.1 for the 16-bit bswap version; I don't know how
significant that is).

Quuxplusone commented 3 years ago

The default architecture for Arm is very old now, and you really have to
specify something newer. This is for -march=armv7-a, which is inline with what
I would expect:
https://godbolt.org/z/qnWrzGKMs

read_unaligned_memcpy_bswap_32(unsigned char const*, int):
  ldr r0, [r0, r1]
  rev r0, r0
  bx lr
read_unaligned_shift_add_32(unsigned char const*, int):
  ldr r0, [r0, r1]
  rev r0, r0
  bx lr

Quuxplusone commented 3 years ago

Ah, I see. That makes sense. Thanks!

Quuxplusone / LLVMBugzillaTest

ARMv7a: inefficient code generated from memcpy + bswap builtin #50588