Crash in UPCR's TLD initialization when compiled with clang

PHHargrove commented 8 years ago

First, I want to clarify that this is a bug report against the clang C compiler built from the current repo, not the UPC compiler. However, the problem appears to be bad code generated for the TLD initialization performed by the Berkeley UPC runtime library. Specifically it looks like movaps instructions are being generated for addresses which are not 16-byte aligned.

There are currently a handful of tests that are failing in our automated testing on an x86-64/Linux platform, when using the Berkeley UPC compiler (translator+runtime) configured to use Clang (build from the clang-upc repo) as the back-end compiler:

benchmarks/stream-relaxed
benchmarks/stream-strict
bugzilla/bug645
intrepid/test24
laplace/stencil
psearch/psearch

The problem appears only when back-end optimizations are enabled (-O1, -O2 and -O3 all fail). The core dumps for all cases, show that the error is in the initialization of thread-local-data (and thus only occurs when -pthreads is passed to upcc), and specifically in code that results from converting a memcpy() call with constant length into a short sequence of MMX instructions:

Core was generated by `./a.out'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004043d0 in upcri_linkergenerated_tld_init ()
(gdb) disass
Dump of assembler code for function upcri_linkergenerated_tld_init:
   0x00000000004043a0 <+0>:     push   %rbx
   0x00000000004043a1 <+1>:     mov    $0x2c0,%edi
   0x00000000004043a6 <+6>:     callq  0x402ad0 <malloc@plt>
   0x00000000004043ab <+11>:    mov    %rax,%rbx
   0x00000000004043ae <+14>:    test   %rbx,%rbx
   0x00000000004043b1 <+17>:    jne    0x4043c4 <upcri_linkergenerated_tld_init+36>
   0x00000000004043b3 <+19>:    mov    $0x4835e9,%edi
   0x00000000004043b8 <+24>:    mov    $0x2c0,%esi
   0x00000000004043bd <+29>:    xor    %eax,%eax
   0x00000000004043bf <+31>:    callq  0x406990 <upcri_errno>
   0x00000000004043c4 <+36>:    mov    0x29a182(%rip),%eax        # 0x69e54c <upcr_forall_control>
   0x00000000004043ca <+42>:    mov    %eax,0x2bc(%rbx)
=> 0x00000000004043d0 <+48>:    movaps 0x2a59e1(%rip),%xmm0        # 0x6a9db8 <alloc_list+16>
   0x00000000004043d7 <+55>:    movups %xmm0,0x170(%rbx)
   0x00000000004043de <+62>:    movaps 0x2a59c3(%rip),%xmm0        # 0x6a9da8 <alloc_list>
   0x00000000004043e5 <+69>:    movups %xmm0,0x160(%rbx)
   0x00000000004043ec <+76>:    mov    %rbx,%rdi
   0x00000000004043ef <+79>:    add    $0x180,%rdi
   0x00000000004043f6 <+86>:    mov    $0x483330,%esi
   0x00000000004043fb <+91>:    mov    $0x138,%edx
   0x0000000000404400 <+96>:    callq  0x402e20 <memcpy@plt>
   0x0000000000404405 <+101>:   mov    0x2a598d(%rip),%eax        # 0x6a9d98 <upcrt_forall_control>
   0x000000000040440b <+107>:   mov    %eax,0x2b8(%rbx)
   0x0000000000404411 <+113>:   mov    %rbx,%rax
   0x0000000000404414 <+116>:   pop    %rbx
   0x0000000000404415 <+117>:   retq
End of assembler dump.

I can work around the problem by passing either -O0 or -fno-builtin-memcpy to the back-end clang compiler.

I am pretty sure from viewing such dumps from all six failing codes that the SEGV is always from a movaps instruction (perhaps even the first one?) even after multiple movups instructions (though not seen in the dump above from test24). The difference between those instructions is that movaps is only valid when the address is 16-byte aligned. Based on the address 0x6a9db8 in comment for the disassembly of the faulting instruction, that appears not to be the case.

Given the machine-generated nature of the TLD initialization code, I do not currently have a simple reproducer. Note, however, that the faulting instruction is loading from initialized static data (hence the IP-relative addressing). So, I do not think this is at all related to __thread.

This does not appear to be Clang bug 26779 because passing -mstackrealign did not resolve the problem.

I have not yet had the opportunity to test against upstream clang.

swatanabe commented 8 years ago

AMDG

There were quite a lot of merge conflicts relating to how alignment is tracked. It's quite possible that I messed something up and clang thinks that it's 16-byte aligned even though it isn't.

In Christ, Steven Watanabe

PHHargrove commented 8 years ago

Steven,

I have just now reproduced the problem with the official llvm and clang 3.8.0 source tarballs. So, the issue does not arise from the merge.

I am trying to build from the llvm-trunk next, but need to build cmake3 to get there from here.

-Paul

PHHargrove commented 8 years ago

I built llvm and clang from the official git mirror of the svn. The problem is still present there.

-Paul

bonachea commented 7 years ago

I also see this problem with the released clang 3.8.1 x86_64-unknown-windows-cygnus, this definitely has nothing to do with the clang-upc branch. I see it only for the psearch test, but the disassembly looks very similar to Paul's example.

I've reproduced this problem in a small example that mimics what UPCR is doing (renamed with a .txt extension because github refuses to attach .c files): clangcp.c.txt

The setup requires two object files linked together, although I've placed the code in the same source file for simplicity and used the preprocessor to get the same effect. Here is the actual code:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

/* header */
typedef struct {
  char b1[4];
  char b2[24];
} foo_t;
extern foo_t *init(void);

#if ONE  /* first source file */

  extern char a1[4];
  extern char a2[24];

  foo_t *init(void) {
    foo_t *foo = malloc(sizeof(foo_t));
    memcpy(&(foo->b1), &a1, sizeof(a1));
    memcpy(&(foo->b2), &a2, sizeof(a2));
    return foo;
  }

#else /* second source file */

  #if STRICT
    char a1[4] = {};
    char a2[24] = {};
  #else
    int a1 = 0;
    struct {
      char x1[20];
      int x2;
    } a2 = { };
  #endif

  int main() {
    foo_t * f = init();
    if (f) printf("ok\n%s",f->b2);
    return 0;
  }
#endif

Works with gcc -O3:

$ set -x ; CC="gcc -O3 -Wall" ; $CC -DONE -c clangcp.c -o one.o && $CC clangcp.c one.o && ./a.exe                  
+ set -x
+ CC='gcc -O3 -Wall'
+ gcc -O3 -Wall -DONE -c clangcp.c -o one.o
+ gcc -O3 -Wall clangcp.c one.o
+ ./a.exe
ok

Fails with clang -O0 or -O1 or -O2 or -O3:

$ set -x ; CC="clang -O0 -Wall" ; $CC -DONE -c clangcp.c -o one.o && $CC clangcp.c one.o && ./a.exe         
+ set -x
+ CC='clang -O0 -Wall'
+ clang -O0 -Wall -DONE -c clangcp.c -o one.o
+ clang -O0 -Wall clangcp.c one.o
+ ./a.exe
Segmentation fault (core dumped)

Dump of assembler code for function init:
   0x0000000100401130 <+0>:     sub    $0x28,%rsp
   0x0000000100401134 <+4>:     mov    $0x1c,%eax
   0x0000000100401139 <+9>:     mov    %eax,%ecx
   0x000000010040113b <+11>:    callq  0x1004011a0 <malloc>
   0x0000000100401140 <+16>:    mov    %rax,0x20(%rsp)
   0x0000000100401145 <+21>:    movabs $0x100407000,%rcx
   0x000000010040114f <+31>:    mov    (%rcx),%edx
   0x0000000100401151 <+33>:    mov    %edx,(%rax)
   0x0000000100401153 <+35>:    mov    0x20(%rsp),%rax
   0x0000000100401158 <+40>:    movabs $0x100407004,%rcx
   0x0000000100401162 <+50>:    mov    0x10(%rcx),%r8
   0x0000000100401166 <+54>:    mov    %r8,0x14(%rax)
=> 0x000000010040116a <+58>:    movaps (%rcx),%xmm0
   0x000000010040116d <+61>:    movups %xmm0,0x4(%rax)
   0x0000000100401171 <+65>:    mov    0x20(%rsp),%rax
   0x0000000100401176 <+70>:    add    $0x28,%rsp
   0x000000010040117a <+74>:    retq   
   0x000000010040117b <+75>:    nopl   0x0(%rax,%rax,1)

Works for clang (all opt levels) if you add -fno-builtin-memcpy:

$ set -x ; CC="clang -O3 -fno-builtin-memcpy -Wall" ; $CC -DONE -c clangcp.c -o one.o && $CC clangcp.c one.o && ./a.exe 
+ set -x
+ CC='clang -O3 -fno-builtin-memcpy -Wall'
+ clang -O3 -fno-builtin-memcpy -Wall -DONE -c clangcp.c -o one.o
+ clang -O3 -fno-builtin-memcpy -Wall clangcp.c one.o
+ ./a.exe
ok

Works for clang (all opt levels) if STRICT is defined, which causes the declarations of a1 and a2 to be type-equivalent in both objects, rather than just size-equivalent.

$ set -x ; CC="clang -O0 -Wall" ; $CC -DONE -c clangcp.c -o one.o && $CC clangcp.c -DSTRICT one.o && ./a.exe
+ set -x
+ CC='clang -O0 -Wall'
+ clang -O0 -Wall -DONE -c clangcp.c -o one.o
+ clang -O0 -Wall clangcp.c -DSTRICT one.o
+ ./a.exe
ok

Analysis:

Unlike UPCR, I see the problem even with -O0. I suspect the fact that clang -O0 makes the problem disappear for UPCR is probably just due to the complexity of the runtime code obfuscating analysis, and not something fundamental. (Most likely the inlining of upcri_checkmalloc, which provides alignment information for the memcpy target)
-fno-builtin-memcpy works around the problem by avoiding the inlined version of memcpy that attempts instructions with insufficient alignment.
UPCR is definitely doing something "sketchy" here, in relying upon the fact that symbols of equal size and internal alignment but different declared types in two object files can be linked together. C99 doesn't say much about external linkage type compatibility requirements, although the ABI might say more.
Additional debugging output confirms the a2 symbol used as the source of the problematic movaps instruction is 16-byte aligned with -DSTRICT where it's defined as a char array (0x100407010) but unaligned without it where it's defined as a struct, as shown above (0x100407004).
The offending instructions appear within the function contained in one.o, which is compiled before the second object where the type punning (conditionally) happens. It appears the compiler ensures stronger alignment for the char array form of the declaration, and the builtin memcpy in one.o relies upon this, but the struct definition in the second object file defines the symbol with a looser alignment, causing the breakage.

TL;DR: Clang may be justified in performing this optimization, and UPCR is probably in the wrong by type-punning our symbols across objects. However, UPCR's design for struct pthread TLD relies heavily on this admitted hack working as expected.

The good news is I believe the problematic type-punning is constrained to startup init (where performance is non-critical), because after that point everything is structure fields within a single object (where we perform other type-punning, but should be respecting the ABI struct field alignment conventions). Furthermore with my new understanding of the problem, I believe I can deploy a surgical fix to defeat the problematic rewriting of this memcpy.

bonachea commented 7 years ago

UPCR workaround committed as b21bb55.

This is hopefully now fixed as far as UPCR is concerned.

clangupc / clang-upc

Crash in UPCR's TLD initialization when compiled with clang #96