Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Linux kernel build: Regression due memcmp to bcmp optimization #40005

Closed Quuxplusone closed 5 years ago

Quuxplusone commented 5 years ago


Bugzilla Link	PR41035
Status	RESOLVED FIXED
Importance	P normal
Reported by	Adhemerval Zanella (adhemerval.zanella@linaro.org)
Reported on	2019-03-11 12:50:47 -0700
Last modified on	2019-04-08 11:29:22 -0700
Version	trunk
Hardware	PC Linux
CC	arnd@linaro.org, ckennelly@google.com, clement.courbet@gmail.com, efriedma@quicinc.com, gchatelet@google.com, hfinkel@anl.gov, jyknight@google.com, llozano@chromium.org, llvm-bugs@lists.llvm.org, manojgupta@google.com, ndesaulniers@google.com, srhines@google.com, szabolcs.nagy@arm.com
Fixed by commit(s)
Attachments
Blocks	PR4068
Blocked by
See also

Revision r355672 [1] introduced a Linux kernel regression build since it Linux
does not build the kernel with -ffreestanding and also it does not provide bcmp.

The compiler command line issued is:

clang -Wp,-MD,init/.do_mounts.o.d  -nostdinc -isystem
/home/adhemerval.zanella/investigate-llvm-
8e16d73346f8091461319a7dfc4ddd18eedcff13/llvm-install/lib/clang/9.0.0/include -
I./arch/arm64/include -I./arch/arm64/include/generated  -I./include -
I./arch/arm64/include/uapi -I./arch/arm64/include/generated/uapi -
I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -
include ./include/linux/compiler_types.h -D__KERNEL__ -mlittle-endian -
DKASAN_SHADOW_SCALE_SHIFT=3 -Qunused-arguments -Wall -Wundef -Werror=strict-
prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-
PIE -Werror-implicit-function-declaration -Werror=implicit-int -Wno-format-
security -std=gnu89 -no-integrated-as -mgeneral-regs-only -DCONFIG_AS_LSE=1 -
fno-asynchronous-unwind-tables -DKASAN_SHADOW_SCALE_SHIFT=3 -fno-delete-null-
pointer-checks -O2 -Wframe-larger-than=2048 -fstack-protector-strong -Wno-
format-invalid-specifier -Wno-gnu -Wno-address-of-packed-member -Wno-
tautological-compare -mno-global-merge -Wno-unused-const-variable -fno-omit-
frame-pointer -fno-optimize-sibling-calls -g -Wdeclaration-after-statement -
Wvla -Wno-pointer-sign -fno-strict-overflow -fno-merge-all-constants -fno-stack-
check -Werror=date-time -Werror=incompatible-pointer-types -Wno-initializer-
overrides -Wno-unused-value -Wno-format -Wno-sign-compare -Wno-format-zero-
length -Wno-uninitialized -fno-function-sections -fno-data-sections    -
DKBUILD_BASENAME='"do_mounts"' -DKBUILD_MODNAME='"mounts"' -c -o
init/do_mounts.o init/do_mounts.c

And the resulting error is:

ld.lld -EL -maarch64elf --no-undefined -X --fix-cortex-a53-843419 --build-id -o
.tmp_vmlinux1 -T ./arch/arm64/kernel/vmlinux.lds --whole-archive built-in.a --
no-whole-archive --start-group arch/arm64/lib/lib.a lib/lib.a --end-group
ld.lld: error: undefined symbol: bcmp
>>> referenced by do_mounts.c:593 (/home/adhemerval.zanella/investigate-llvm-
8e16d73346f8091461319a7dfc4ddd18eedcff13/linux/init/do_mounts.c:593)
>>>               init/do_mounts.o:(prepare_namespace) in archive built-in.a

ld.lld: error: undefined symbol: bcmp
>>> referenced by do_mounts.c:594 (/home/adhemerval.zanella/investigate-llvm-
8e16d73346f8091461319a7dfc4ddd18eedcff13/linux/init/do_mounts.c:594)
>>>               init/do_mounts.o:(prepare_namespace) in archive built-in.a

[...]

This can be fixed on Linux kernel side, but it will require to add -fno-builtin-
bcmp and rollout this on all support releases. However I really think we should
disable such transformation for -nostdinc. Also, as jyknight has stated on
https://reviews.llvm.org/D56593, this optimization currently is also not a
performance gain on Linux (on glibc it is a weak alias to memcmp, while on musl
is a tail call to memcmp).

[1] http://llvm.org/viewvc/llvm-project?view=revision&revision=355672

Quuxplusone commented 5 years ago

Hi Adhemerval,

Thanks for reporting.

This can be fixed on Linux kernel side, but it will require to add -fno-builtin-bcmp and rollout this on all support releases. However I really think we should disable such transformation for -nostdinc.

I believe the right way to address this would be to actually use -ffreestanding, which if I understand correctly is really meant for this. Why is linux not using it btw ?

It seems weird to use -nostdinc for anything other than dealing with headers, but I'm not very familiar with these options. Eli, any opinion on this ?

Also, as jyknight has stated on https://reviews.llvm.org/D56593, this optimization currently is also not a performance gain on Linux (on glibc it is a weak alias to memcmp, while on musl is a tail call to memcmp).

Guillaume (cced) is working on getting bcmp into the glibc.

Quuxplusone commented 5 years ago

Side note:

We're spending 60% of our linux memcmp() cycles in the two functions below,
which could use bcmp(), so Linux might want to actually provide it :) It's
still a very small portion of total kernel cycles (<0.1%) though, and will not
solve the problem you're mentioning for older kernels...

43% of memcmp cycles here:
https://github.com/torvalds/linux/blob/ea295481b6e313b4ea3ca2720ffcafd6005b5643/net/ipv6/ip6_offload.c#L264

and 17% here:
https://github.com/torvalds/linux/blob/a978a5b8d83f795e107a2ff759b28643739be70e/kernel/bpf/hashtab.c#L450

Quuxplusone commented 5 years ago

(In reply to Clement Courbet from comment #1)
> Hi Adhemerval,
>
> Thanks for reporting.
>
> > This can be fixed on Linux kernel side, but it will require to add -fno-
builtin-bcmp and rollout this on all support releases. However I really think
we should disable such transformation for -nostdinc.
>
> I believe the right way to address this would be to actually use
> -ffreestanding, which if I understand correctly is really meant for this.
> Why is linux not using it btw ?

My understanding from Linux commit 6edfba1b33c701108717f4e036320fc39abe1912 is
they want compiler possible optimizations from builtins transformation and to
avoid possible wrong transformation.

I also noted some architectures reenable it (m68k for instance with
28713169d879b67be2ef2f84dcf54905de238294 and mips with
72fbfb260197a52c2bc2583f3e8f15d261d0f924).

I am ccing Arnd Bergmann with has dealt with some issue regarding it on Linux
to check kernel position regarding -ffreestanding.

>
> It seems weird to use -nostdinc for anything other than dealing with
> headers, but I'm not very familiar with these options. Eli, any opinion on
> this ?
>
> > Also, as jyknight has stated on https://reviews.llvm.org/D56593, this
optimization currently is also not a performance gain on Linux (on glibc it is
a weak alias to memcmp, while on musl is a tail call to memcmp).
>
> Guillaume (cced) is working on getting bcmp into the glibc.

I skeptics about the introduction of either another memory compare symbol or
optimizing bcmp on C runtime. The memcmp is one the most important symbol
provided by libc and thus vendors spent considerable time optimizing with
custom implementations.

x86_64, for instance, provides 4 different memcmp implementations, so I would
expect for bcmp to really be a gain we will need to also provide 4 different
implementations using similar strategies (a default SSE2, a SSSE3, a SSE4_1,
and an AVX one). This is just for for x86_64, other architectures also provide
different implementations for each ISA and/or extension, AArch64 is now
provided custom tuned memcmp for different chips vendors.

If now compiler starts to emit builtin calls to another memory compare symbol
it means vendors will need to add extra effort to tune another symbol. Do we
have realworld number where this is indeed a net gain? The two kernel snippets
where memcmp cannot be inlined do really justify such optimization?

Quuxplusone commented 5 years ago

The two kernel snippets where memcmp cannot be inlined do really justify such optimization? Do we have realworld number where this is indeed a net gain?

As mentioned above, in our case the Linux kernel spends less than 0.1% of cycles in memcmp, so this is not a very big deal for us. What I was trying to get at with this example is that the pattern memcmp() == 0 appears in code quite regularly.

Where we do spend a lot of cycles in memcmp is userland code. That's why we care about bcmp.

I'll let Guillaume chime in with some numbers here.

Quuxplusone commented 5 years ago

The generic memcmp() implementation in the kernel is not optimized at all and just uses a loop over single-byte loads, so any architectures using that would likely not benefit from using bcmp() instead.

The architectures that provide their own memcmp() today have all implemented it in assembler: arc, arm64, csky, powerpc, s390 and sparc. I suppose those could benefit, but an optimized bcmp() would have to be written for each one by someone with knowledge of those architectures first.

I would suggest skipping that optimization entirely in the kernel, either by adding an alias to memcpy() in lib/string.c, or by passing -fno-builtin-bcmp().

For the code linked to in comment #2, I notice that gcc will happily replace the memcmp() with an inlined version on x86 (not on arm), while clang decides to call the extern version: https://godbolt.org/z/pnkGII

Quuxplusone commented 5 years ago

As an example of how the semantics of bcmp simplify implementation: The generic implementation of the kernel memcmp() is:

__visible int memcmp(const void *cs, const void *ct, size_t count)
{
    const unsigned char *su1, *su2;
    int res = 0;

    for (su1 = cs, su2 = ct; 0 < count; ++su1, ++su2, count--)
        if ((res = *su1 - *su2) != 0)
            break;
    return res;
}
EXPORT_SYMBOL(memcmp);

unrolling this in a target-agnostic way is not trivial because of endianness issues, but the semantics of bcmp allow it to be trivially unrolled in a generic manner sizeof(long) times with something like:

__visible int bcmp(const void *cs, const void *ct, size_t count)
{
    const unsigned char *c1, *c2;
    const unsigned long *l1, *l2;

    for (l1 = cs, l2 = ct; 0 < count; ++l1, ++l2, count-=sizeof(l1))
        if (*l1 != *l2)
            return 1;

    for (c1 = l1, c2 = l2; 0 < count; ++c1, ++c2, count--)
        if (*c1 != *c2)
            return 1;
    return 0;
}
EXPORT_SYMBOL(bcmp);

I would expect (but I have not measured it) the latter to be faster than the former on most sizes, as it typically does 8 times fewer memory accesses and comparisons at the cost of one additional branch.

I would suggest skipping that optimization entirely in the kernel, either by adding an alias to memcpy() in lib/string.c, or by passing -fno-builtin-bcmp().

SGTM.

Quuxplusone commented 5 years ago

memcmp has to scan the buffers linearly to find the first mismatch because it
defines an ordering. bcmp does not have to and it allows for an interesting
optimization.

Let's take the example of sorted strings. Comparing adjacent strings will
compare equal for most of the string (since they are sorted) but bcmp can read
the first and last words of both buffers, and return false if they differ. i.e.
memcmp: O(n), bcmp: O(1) for this such a data set.

Now bcmp is also doing less work than memcmp (no need to compare bytes) so
implementation is shorter: this is great for Instruction Cache pressure.
Our implementation using sse is 693B where __memcmp_sse4_1 is 5694B.

Some numbers on the current (not final) implementation, ratio of bcmp over
memcmp
 for size from 0 to 2KiB for Sandybridge, Haswell and Skylake.
 - when buffers compare equal, up to +20%.
 - when buffers differ at the tail, the gain is linear with the buffer's length (+2000% for 2KiB buffers)
 - when buffers differ at the middle, the gain is up to 120%

This is still WIP but we plan to contribute this back to glibc in the following
months.

Quuxplusone commented 5 years ago

(In reply to Clement Courbet from comment #6)

As an example of how the semantics of bcmp simplify implementation: The generic implementation of the kernel memcmp() is:
__visible int memcmp(const void *cs, const void *ct, size_t count)
{
  const unsigned char *su1, *su2;
  int res = 0;

  for (su1 = cs, su2 = ct; 0 < count; ++su1, ++su2, count--)
      if ((res = *su1 - *su2) != 0)
          break;
  return res;
}
EXPORT_SYMBOL(memcmp);
unrolling this in a target-agnostic way is not trivial because of endianness issues, but the semantics of bcmp allow it to be trivially unrolled in a generic manner sizeof(long) times with something like:
__visible int bcmp(const void *cs, const void *ct, size_t count)
{
  const unsigned char *c1, *c2;
  const unsigned long *l1, *l2;

  for (l1 = cs, l2 = ct; 0 < count; ++l1, ++l2, count-=sizeof(l1))
      if (*l1 != *l2)
          return 1;

  for (c1 = l1, c2 = l2; 0 < count; ++c1, ++c2, count--)
      if (*c1 != *c2)
          return 1;
  return 0;
}
EXPORT_SYMBOL(bcmp);

I am not sure if it is that hard to unroll the loop in platform agnostic way:

#if __BYTE_ORDER == __BIG_ENDIAN
# define cmp_lt_or_gt(a, b) ((a) > (b) ? 1 : -1)
#else
# define cmp_lt_or_gt(a, b) ((__builtin_bswap64 (a) > __builtin_bswap64 (b) ? 1 : -1))
#endif

__visible int bcmp(const void *cs, const void *ct, size_t count)
{
    const unsigned char *c1, *c2;
    const unsigned long *l1, *l2;

    for (l1 = cs, l2 = ct; sizeof(l1) < count; ++l1, ++l2, count-=sizeof(l1))
        if (*l1 != *l2)
            return cmp_lt_or_gt (*l1, *l2);

    for (c1 = l1, c2 = l2; 0 < count; ++c1, ++c2, count--)
        if (*c1 != *c2)
            return *c1 > *c2 ? 1 : -1;
    return 0;
}

This assumes the architecture has fast byte swap instruction and I believe it can be tuned better, but I don't think this an intrinsic hard problem to track.

I would expect (but I have not measured it) the latter to be faster than the former on most sizes, as it typically does 8 times fewer memory accesses and comparisons at the cost of one additional branch.

I would suggest skipping that optimization entirely in the kernel, either by adding an alias to memcpy() in lib/string.c, or by passing -fno-builtin-bcmp().

SGTM.

Quuxplusone commented 5 years ago

(In reply to Guillaume Chatelet from comment #7)
> memcmp has to scan the buffers linearly to find the first mismatch because
> it defines an ordering. bcmp does not have to and it allows for an
> interesting optimization.

For a generic implementation to be C agnostic I would tend to agree, but it is
quite usual for arch-specific implementations (or even generic implementation
in some cases) to allow for some useful specifications that would not defined
by standard (for instance read past the find byte using word or vector
operations).

>
> Let's take the example of sorted strings. Comparing adjacent strings will
> compare equal for most of the string (since they are sorted) but bcmp can
> read the first and last words of both buffers, and return false if they
> differ. i.e. memcmp: O(n), bcmp: O(1) for this such a data set.

Right, this strategy indeed would require more work with memcmp.

>
> Now bcmp is also doing less work than memcmp (no need to compare bytes) so
> implementation is shorter: this is great for Instruction Cache pressure.
> Our implementation using sse is 693B where __memcmp_sse4_1 is 5694B.
>
> Some numbers on the current (not final) implementation, ratio of bcmp over
> memcmp
>  for size from 0 to 2KiB for Sandybridge, Haswell and Skylake.
>  - when buffers compare equal, up to +20%.
>  - when buffers differ at the tail, the gain is linear with the buffer's
> length (+2000% for 2KiB buffers)
>  - when buffers differ at the middle, the gain is up to 120%

Which memcmp exactly are you comparing against for this evaluation?

>
> This is still WIP but we plan to contribute this back to glibc in the
> following months.

I do see some gain using this strategy, but again this will be another x86_64
specific optimization that will take time and effort for other architectures to
be on par with.

Quuxplusone commented 5 years ago

(In reply to Adhemerval Zanella from comment #9)
> (In reply to Guillaume Chatelet from comment #7)
> > memcmp has to scan the buffers linearly to find the first mismatch because
> > it defines an ordering. bcmp does not have to and it allows for an
> > interesting optimization.
>
> For a generic implementation to be C agnostic I would tend to agree, but it
> is quite usual for arch-specific implementations (or even generic
> implementation in some cases) to allow for some useful specifications that
> would not defined by standard (for instance read past the find byte using
> word or vector operations).

I concur with you. Our bcmp implementation uses the same galloping techniques.
My point is, by definition memcmp has to read at least all the bytes before the
first mismatch, bcmp doesn't have to.

> >
> > Now bcmp is also doing less work than memcmp (no need to compare bytes) so
> > implementation is shorter: this is great for Instruction Cache pressure.
> > Our implementation using sse is 693B where __memcmp_sse4_1 is 5694B.
> >
> > Some numbers on the current (not final) implementation, ratio of bcmp over
> > memcmp
> >  for size from 0 to 2KiB for Sandybridge, Haswell and Skylake.
> >  - when buffers compare equal, up to +20%.
> >  - when buffers differ at the tail, the gain is linear with the buffer's
> > length (+2000% for 2KiB buffers)
> >  - when buffers differ at the middle, the gain is up to 120%
>
> Which memcmp exactly are you comparing against for this evaluation?

__memcmp_sse4_1 from glibc 2.19.

>
> >
> > This is still WIP but we plan to contribute this back to glibc in the
> > following months.
>
> I do see some gain using this strategy, but again this will be another
> x86_64 specific optimization that will take time and effort for other
> architectures to be on par with.

Actually the code is quite generic and can be easily adapted to other
architectures but you're right, for now, it has been optimized for x86_64.
Now I don't think we should discard a good optimization because it does not
apply to all architectures at once.

Quuxplusone commented 5 years ago

(In reply to Adhemerval Zanella from comment #8)

(In reply to Clement Courbet from comment #6)

I am not sure if it is that hard to unroll the loop in platform agnostic way:
#if __BYTE_ORDER == __BIG_ENDIAN
# define cmp_lt_or_gt(a, b) ((a) > (b) ? 1 : -1)
#else
# define cmp_lt_or_gt(a, b) ((__builtin_bswap64 (a) > __builtin_bswap64 (b)
? 1 : -1))
#endif

The kernel defines a cpu_to_le32p()/cpu_to_le64p() helper that would be used here, but it would still have to do this separately for 32-bit and 64-bit, and it may have to worry about alignment.

__visible int bcmp(const void cs, const void ct, size_t count) { const unsigned char c1, c2; const unsigned long l1, l2;

for (l1 = cs, l2 = ct; sizeof(l1) < count; ++l1, ++l2, count-=sizeof(l1)) if (l1 != l2) return cmp_lt_or_gt (l1, l2);

The byteswap is only needed for memcmp(), but not bcmp() if I understand right what it does.

If we want to provide an optimized bcmp() for the kernel, it should not be visible for architectures without unaligned access, and for everyone else we might have a generic implementation, something like

if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)

int bcmp(const void s1, const void s2, size_t n) { unsigned long v = 0; const char e1 = s1+n, e2 = s2+n;

/* short data */
if (n <= sizeof(long)) {
    switch (n) {
    case 4 ... 8:
        return (*(unsigned int *)s1    ^*(unsigned int*)s2) |
               (*(unsigned int *)(e1-4)^*(unsigned int*)(e1-4));
    case 3:
        v |= *(unsigned char *)(e1-3)^*(unsigned char*)(e2-3);
    case 2:
        v |= *(unsigned char *)(e1-2)^*(unsigned char*)(e2-2);
    case 1:
        v |= *(unsigned char *)(e1-1)^*(unsigned char*)(e2-1);
    }
    return v;
}

/* peek at last word first */
v = *((unsigned long*)e1 - 1) ^
    *((unsigned long*)e2 - 1);
if (v)
          return 1;

/* loop over the rest from the start */
 while (n > sizeof(unsigned long)) {
    if ((*(unsigned long*)s1 != *(unsigned long*)s2))
        return 1;
    s1 += sizeof(unsigned long);
    s2 += sizeof(unsigned long);
    n -= sizeof(unsigned long);
 }

return 0;

}

else

int bcmp(const void s1, const void s2, size_t n) attribute((alias("memcmp")));

endif

Quuxplusone commented 5 years ago

(In reply to Arnd Bergmann from comment #11)

(In reply to Adhemerval Zanella from comment #8)
(In reply to Clement Courbet from comment #6)

I am not sure if it is that hard to unroll the loop in platform agnostic way:
#if __BYTE_ORDER == __BIG_ENDIAN
# define cmp_lt_or_gt(a, b) ((a) > (b) ? 1 : -1)
#else
# define cmp_lt_or_gt(a, b) ((__builtin_bswap64 (a) > __builtin_bswap64 (b)
? 1 : -1))
#endif
The kernel defines a cpu_to_le32p()/cpu_to_le64p() helper that would be used here, but it would still have to do this separately for 32-bit and 64-bit, and it may have to worry about alignment.

This just rough example, my point is memcmp expansion can be worked out (sure alignments constraints, tail handling, etc should be considered). In any case, I understand the approach from kernel side to handle this issue is to fix it by either adding an alias to memcmp or suppressing the builtin, correct? Should we close this issue as WONTFIX?

Quuxplusone commented 5 years ago

This is working as intended.

When a compile doesn't specify -ffreestanding, the compiler assumes that the
functions provided by the environment's normal libc will be available to the
link. (And in this case, glibc provides bcmp, so the compiler may assume bcmp's
availability.)

I do understand *why* the kernel is not using -ffreestanding -- it wants to get
the typical compiler optimizations based on the semantics of the standard
string/mem functions.

But, as happened here, building in that way may periodically require
implementing some additional function, or providing additional -fno-builtin
flags.

We could discuss adding support in the compiler for doing the reverse --
starting with -ffreestanding, and then explicitly enabling all of the builtins
that the kernel DOES provide and want optimizations for. That seems a
potentially reasonable compiler enhancement, if kernel folks want to go that
way, but I'm not really sure it's worth it, given how infrequently this sort of
situation pops up.

Quuxplusone commented 5 years ago

(In reply to James Y Knight from comment #13)
> When a compile doesn't specify -ffreestanding, the compiler assumes that the
> functions provided by the environment's normal libc will be available to the
> link. (And in this case, glibc provides bcmp, so the compiler may assume
> bcmp's availability.)

bcmp is not specified by any standard and violates
the c and posix extern linkage namespace rules so
the compiler should not generate it, in particular
with static linking it may break applications that
use this symbol for something else (which is valid
use and works with glibc currently).

this is a bug that should be fixed.

Quuxplusone commented 5 years ago

(In reply to Szabolcs Nagy from comment #14)
> (In reply to James Y Knight from comment #13)
> > When a compile doesn't specify -ffreestanding, the compiler assumes that the
> > functions provided by the environment's normal libc will be available to the
> > link. (And in this case, glibc provides bcmp, so the compiler may assume
> > bcmp's availability.)
>
> bcmp is not specified by any standard and violates
> the c and posix extern linkage namespace rules so
> the compiler should not generate it, in particular
> with static linking it may break applications that
> use this symbol for something else (which is valid
> use and works with glibc currently).
>
> this is a bug that should be fixed.

What target triple is used for the compilation? Does it end in -gnu? Maybe we
want a different target triple for kernel compilation?

Quuxplusone commented 5 years ago

The objection in comment #14 is unrelated to the kernel.

The bcmp function is indeed specified by multiple standards, for example SuSv3 aka POSIX-2001. (However, it is not specified by POSIX-2008.)

More to the point, however, the compiler does have certain environments which it supports. When it's targeting a known environment (e.g. "linux-gnu"), it enables support for library functions which are specified by glibc, not those specified by a published standard.

This is far from the only library function which clang/llvm knows about which is not specified by posix.

Saying that the compiler should only know about functions which are specified by a standards body (and, then, which one? ISO C or POSIX?) would be an extreme departure from current practice.

Quuxplusone commented 5 years ago

Re #15: While using some target triple which doesn't imply that gnu libc is in use is quite possibly a good idea, I'm not sure if it may have some other effects which may be undesirable at least in the short-term.

The kernel is typically built now using a GCC toolchain configured for the GNU environment, and if there are significant differences between, say, arm-linux-gnueabihf and arm-linux-eabihf, it may well be expecting the GNU-specific behavior.

It certainly does seem worth a shot, as a long-term cleanliness issue though!

Quuxplusone commented 5 years ago

(In reply to James Y Knight from comment #16)
> The objection in comment #14 is unrelated to the kernel.
>
> The bcmp function is indeed specified by multiple standards, for example
> SuSv3 aka POSIX-2001. (However, it is not specified by POSIX-2008.)
>
> More to the point, however, the compiler does have certain environments
> which it supports. When it's targeting a known environment (e.g.
> "linux-gnu"), it enables support for library functions which are specified
> by glibc, not those specified by a published standard.
>
> This is far from the only library function which clang/llvm knows about
> which is not specified by posix.
>
> Saying that the compiler should only know about functions which are
> specified by a standards body (and, then, which one? ISO C or POSIX?) would
> be an extreme departure from current practice.

As Szabolcs has pointed out, this optimization will introduce linkage namespace
issues. The bcmp is expected to be a defined symbol only if user specify which
is the expected standard he is building against.

For instance, glibc explicit check for linkage namespace pollution for internal
symbols [1] and bcmp is defined for:

   1 #if !defined ISO && !defined ISO99 && !defined ISO11 && !defined POSIX && !defined XPG4
   2 # if !defined XOPEN2K8 && !defined POSIX2008
   3 function int bcmp (const void*, const void*, size_t)
   [...]
   # endif

It means that for -std=c99 and -std=c11, glibc will *not* define 'bcmp' (as it
is not defined by the standard) and also not expect in static linking the
'bcmp' symbol name. This also means programs are free to have an internal
symbol with the same name.

Same when defining the expected POSIX interface, if users user '-
D_POSIX_C_SOURCE=200112L' (meaning POSIX-2001) bcmp will be defined. However,
if users explicit specify POSIX 2008 by -D_POSIX_C_SOURCE=200908L it will be
not defined.

This might not be an issue with generic glibc usage because of either
_GNU_SOURCE or default (by _DEFAULT_SOURCE), but as aforementioned it will
start to cause linkspace issues on c99 and c11 code that eventually start to
define 'bcmp' and targeting only POSIX 2008 (which a libc implementation is
free to do so).

So I think llvm need to implement this optimization by using a name from the
reserved namespace by either adding new api to the standard or libc.

[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=conform/data/strings.h-
data;h=13827ebed9a1c42e9a5e8fd3826de877adfa2d52;hb=HEAD

Quuxplusone commented 5 years ago

Again, this is unrelated to the reported kernel issue with bcmp first reported here. And also again, this is how LLVM treats all known library functions, not a new or unique issue with bcmp.

LLVM has a wide variety of non-standard builtin functions which it recognizes in certain environments, and will optimizes calls from and to these functions regardless of the standards mode of the source code. Users may certainly define such a function in their own code, but it had better have the expected semantics, or bad things are going to happen.

See, for example, "stpcpy" which gets recognized and optimized (regardless of language mode), or "memset_pattern16" or "fputc_unlocked" which can be emitted as the result of optimizations of other functions (again regardless of language mode).

I do see the argument that LLVM shouldn't treat all libfuncs in this way when calling clang with a strict conformance mode -- e.g. that -std=c90 should ensure that only functions defined by C90 (or any implementation-namespace functions) may be recognized by the backend, or emitted from a transformation of other code, in case the program uses the an un-reserved name for other purposes.

But that's not the behavior today, across the board. If you'd like to change that, it is independent of this issue, and needs to be moved to a separate wider discussion.

Now, if it were the case that real programs were especially likely to be redefining "bcmp" in particular with semantics incompatible with the usual definition, perhaps we should be considering doing something specifically for it. But I do not think anyone has indicated that to be the case, and I'd rather expect the opposite, considering that it is a longstanding function.

Quuxplusone commented 5 years ago

(In reply to James Y Knight from comment #19)
> Again, this is unrelated to the reported kernel issue with bcmp first
> reported here. And also again, this is how LLVM treats all known library
> functions, not a new or unique issue with bcmp.

It is an issue because LLVM is emitting a nonstandard symbol even when the user
is not expecting such behavior. There is no issue when LLVM emits memset or
memcpy, for instance, because such symbols are defined in all supported
standards.

>
> LLVM has a wide variety of non-standard builtin functions which it
> recognizes in certain environments, and will optimizes calls from and to
> these functions regardless of the standards mode of the source code. Users
> may certainly define such a function in their own code, but it had better
> have the expected semantics, or bad things are going to happen.
>
> See, for example, "stpcpy" which gets recognized and optimized (regardless
> of language mode), or "memset_pattern16" or "fputc_unlocked" which can be
> emitted as the result of optimizations of other functions (again regardless
> of language mode).

All examples are for non-standard symbols which are simplified by compiler
issuing standard symbols. The 'bcmp' is doing the inverse.

>
> I do see the argument that LLVM _shouldn't_ treat all libfuncs in this way
> when calling clang with a strict conformance mode -- e.g. that `-std=c90`
> should ensure that only functions defined by C90 (or any
> implementation-namespace functions) may be recognized by the backend, or
> emitted from a transformation of other code, in case the program uses the an
> un-reserved name for other purposes.

No, LLVM should do the inverse: only enable the 'bcmp' optimization explicitly
when the user define it. It allows code building for c99 and c11 to not break
(where they explicit implements bcmp).

>
> But that's not the behavior today, across the board. If you'd like to change
> that, it is independent of this issue, and needs to be moved to a separate
> wider discussion.
>
> Now, if it were the case that real programs were especially likely to be
> redefining "bcmp" in particular with semantics incompatible with the usual
> definition, perhaps we should be considering doing something specifically
> for it. But I do not think anyone has indicated that to be the case, and I'd
> rather expect the opposite, considering that it is a longstanding function.

This is a dangerous precedent, it means LLVM can now wildly emit deprecated
symbols users are not expecting and it is up to the users to detect and adjust
its code accordingly.

I am not stating this change should be reversed, my point is it should not be
enabled as default as-is or LLVM should either use a configurable non-standard
symbol or work towards making this new memcmp a standard de-facto.

Quuxplusone commented 5 years ago

(In reply to Adhemerval Zanella from comment #20)
> > See, for example, "stpcpy" which gets recognized and optimized (regardless
> > of language mode), or "memset_pattern16" or "fputc_unlocked" which can be
> > emitted as the result of optimizations of other functions (again regardless
> > of language mode).
>
> All examples are for non-standard symbols which are simplified by compiler
> issuing standard symbols. The 'bcmp' is doing the inverse.

Incorrect. The latter two examples are emitted by the backend as transforms of
other code (fputc can transform into fputc_unlocked, and certain store-loops
into memset_pattern16).

But _both_ directions are invalid, if we wish to allow users to incompatibly
define their own implementations of known libc functions. I could define a
function "char *stpcpy(char *src, char *dest)" which copies from the first arg
to the second arg. But then, llvm will simplify it based on the known-semantics
being that the _first_ arg is "dest".

> This is a dangerous precedent

This is not precedent, the precedent is quite widely established already.

Quuxplusone commented 5 years ago

(In reply to Adhemerval Zanella from comment #20)
> (In reply to James Y Knight from comment #19)
> > Again, this is unrelated to the reported kernel issue with bcmp first
> > reported here. And also again, this is how LLVM treats all known library
> > functions, not a new or unique issue with bcmp.
>
> It is an issue because LLVM is emitting a nonstandard symbol even when the
> user is not expecting such behavior. There is no issue when LLVM emits
> memset or memcpy, for instance, because such symbols are defined in all
> supported standards.
>

The language standard affects the input to the compiler, programming-
environment standards (e.g., POSIX) do likewise - they do not restrict the
output from the compiler. The output from the compiler is specific to the
target and execution environment in many ways. Again, the issue here is
providing LLVM with an appropriate execution-environment profile for this case
(or whatever other cases are relevant). Right now, we use the target triple to
identify the execution environment, with some enhancements for other aspects
such as available vector math libraries, and this is the part of LLVM that
needs to be adapted to solve these kinds of problems.

The request here is that LLVM, by default, assume a limited set of available
system symbols - specifically, one that corresponds to some standard such as
POSIX. The problem, as best I can tell, is that LLVM already has this
conservative-by-default behavior. The problem is that the compiler is being
configured with an execution-environment profile (e.g., a target triple of *-
linux-gnu) that provides for the availability of more than that. If we want
minor variants of that, then we need a new profile (e.g., a new triple).

Quuxplusone commented 5 years ago

(In reply to James Y Knight from comment #21)
> (In reply to Adhemerval Zanella from comment #20)
> > > See, for example, "stpcpy" which gets recognized and optimized (regardless
> > > of language mode), or "memset_pattern16" or "fputc_unlocked" which can be
> > > emitted as the result of optimizations of other functions (again
regardless
> > > of language mode).
> >
> > All examples are for non-standard symbols which are simplified by compiler
> > issuing standard symbols. The 'bcmp' is doing the inverse.
>
> Incorrect. The latter two examples are emitted by the backend as transforms
> of other code (fputc can transform into fputc_unlocked, and certain
> store-loops into memset_pattern16).

So it is non-standard symbol being transformed to non-standard symbols. At
least for glibc, GNU extensions are guaranteed to continue to be provided. At
least GCC code seems to make this kind of builtin definition to now be enabled
in all cases (but this is another topic).

>
> But _both_ directions are invalid, if we wish to allow users to incompatibly
> define their own implementations of known libc functions. I could define a
> function "char *stpcpy(char *src, char *dest)" which copies from the first
> arg to the second arg. But then, llvm will simplify it based on the
> known-semantics being that the _first_ arg is "dest".
>
> > This is a dangerous precedent
>
> This is not precedent, the precedent is quite widely established already.

Sigh... alright. I still would prefer if we could make it the bcmp with the
optimized semantic a standard symbol before pushing this kind of optimization
on compiler side to exactly avoid the kernel build breakage.

Quuxplusone commented 5 years ago

(In reply to Adhemerval Zanella from comment #23)
> (In reply to James Y Knight from comment #21)
> > Incorrect. The latter two examples are emitted by the backend as transforms
> > of other code (fputc can transform into fputc_unlocked, and certain
> > store-loops into memset_pattern16).
>
> So it is non-standard symbol being transformed to non-standard symbols. At
> least for glibc, GNU extensions are guaranteed to continue to be provided.
> At least GCC code seems to make this kind of builtin definition to now be
> enabled in all cases (but this is another topic).

I looked at a couple of other libc implementations:

- musl and glibc define bcmp, as you said in comment #1.
- uClibc can be configured to define it, but the default is to leave it out
- bionic (Android) does not define it
- newlib defines it

All implementations that I looked at that provide bcmp() make it an alias or a
tail call to memcmp().

Quuxplusone commented 5 years ago

(In reply to Arnd Bergmann from comment #24)
> (In reply to Adhemerval Zanella from comment #23)
> > (In reply to James Y Knight from comment #21)
> > > Incorrect. The latter two examples are emitted by the backend as
transforms
> > > of other code (fputc can transform into fputc_unlocked, and certain
> > > store-loops into memset_pattern16).
> >
> > So it is non-standard symbol being transformed to non-standard symbols. At
> > least for glibc, GNU extensions are guaranteed to continue to be provided.
> > At least GCC code seems to make this kind of builtin definition to now be
> > enabled in all cases (but this is another topic).
>
> I looked at a couple of other libc implementations:
>
> - musl and glibc define bcmp, as you said in comment #1.
> - uClibc can be configured to define it, but the default is to leave it out
> - bionic (Android) does not define it
> - newlib defines it

Yes, this was previously discussed on llvm-dev. Please note that LLVM only
emits a call for target environments which provide it (as specified by the
target triple).

> All implementations that I looked at that provide bcmp() make it an alias or
> a tail call to memcmp().

Guillaume is working on contributing an optimized version at least to glibc.

(Also as previously mentioned, NetBSD and FreeBSD have separate implementations
already, which do omit computation of the ordering for the mismatching words,
but don't use the trick of doing an early comparison of end)

Quuxplusone commented 5 years ago

Changing the target triple for the kernel is overkill. We can just provide an implementation in the kernel: https://lkml.org/lkml/2019/3/13/562

Quuxplusone commented 5 years ago

(In reply to Arnd Bergmann from comment #24)
> - bionic (Android) does not define it

Thanks for this report, I've reported it internally to the bionic maintainers.

Quuxplusone commented 5 years ago

The fix for this is on the kernel side is now in mainline (I expect LTS
backports this week).
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5f074f3e192f10c9fade898b9b3b8812e3d83342

Next time a "compiler optimization" that relies on library support is written,
please ensure that relevant libraries for environments that use LLVM actually
have the proper support.  Otherwise such "compiler optimizations" will regress
builds for environments that don't have the library support.

Specifically in this case, I think it would have been better to ship the
optimized version in glibc first thus proving why such a compiler optimization
would be worthwhile, then verify that all relevant targets support bcmp, then
ship the optimization in LLVM.

Closing this issue now as fixed (it was originally filed because of the kernel
regression).  Please reopen if there's more to do here.