Open PHHargrove opened 8 years ago
AMDG
There were quite a lot of merge conflicts relating to how alignment is tracked. It's quite possible that I messed something up and clang thinks that it's 16-byte aligned even though it isn't.
In Christ, Steven Watanabe
Steven,
I have just now reproduced the problem with the official llvm and clang 3.8.0 source tarballs. So, the issue does not arise from the merge.
I am trying to build from the llvm-trunk next, but need to build cmake3 to get there from here.
-Paul
I built llvm and clang from the official git mirror of the svn. The problem is still present there.
-Paul
I also see this problem with the released clang 3.8.1 x86_64-unknown-windows-cygnus, this definitely has nothing to do with the clang-upc branch. I see it only for the psearch test, but the disassembly looks very similar to Paul's example.
I've reproduced this problem in a small example that mimics what UPCR is doing (renamed with a .txt extension because github refuses to attach .c files): clangcp.c.txt
The setup requires two object files linked together, although I've placed the code in the same source file for simplicity and used the preprocessor to get the same effect. Here is the actual code:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
/* header */
typedef struct {
char b1[4];
char b2[24];
} foo_t;
extern foo_t *init(void);
#if ONE /* first source file */
extern char a1[4];
extern char a2[24];
foo_t *init(void) {
foo_t *foo = malloc(sizeof(foo_t));
memcpy(&(foo->b1), &a1, sizeof(a1));
memcpy(&(foo->b2), &a2, sizeof(a2));
return foo;
}
#else /* second source file */
#if STRICT
char a1[4] = {};
char a2[24] = {};
#else
int a1 = 0;
struct {
char x1[20];
int x2;
} a2 = { };
#endif
int main() {
foo_t * f = init();
if (f) printf("ok\n%s",f->b2);
return 0;
}
#endif
Works with gcc -O3:
$ set -x ; CC="gcc -O3 -Wall" ; $CC -DONE -c clangcp.c -o one.o && $CC clangcp.c one.o && ./a.exe
+ set -x
+ CC='gcc -O3 -Wall'
+ gcc -O3 -Wall -DONE -c clangcp.c -o one.o
+ gcc -O3 -Wall clangcp.c one.o
+ ./a.exe
ok
Fails with clang -O0 or -O1 or -O2 or -O3:
$ set -x ; CC="clang -O0 -Wall" ; $CC -DONE -c clangcp.c -o one.o && $CC clangcp.c one.o && ./a.exe
+ set -x
+ CC='clang -O0 -Wall'
+ clang -O0 -Wall -DONE -c clangcp.c -o one.o
+ clang -O0 -Wall clangcp.c one.o
+ ./a.exe
Segmentation fault (core dumped)
Dump of assembler code for function init:
0x0000000100401130 <+0>: sub $0x28,%rsp
0x0000000100401134 <+4>: mov $0x1c,%eax
0x0000000100401139 <+9>: mov %eax,%ecx
0x000000010040113b <+11>: callq 0x1004011a0 <malloc>
0x0000000100401140 <+16>: mov %rax,0x20(%rsp)
0x0000000100401145 <+21>: movabs $0x100407000,%rcx
0x000000010040114f <+31>: mov (%rcx),%edx
0x0000000100401151 <+33>: mov %edx,(%rax)
0x0000000100401153 <+35>: mov 0x20(%rsp),%rax
0x0000000100401158 <+40>: movabs $0x100407004,%rcx
0x0000000100401162 <+50>: mov 0x10(%rcx),%r8
0x0000000100401166 <+54>: mov %r8,0x14(%rax)
=> 0x000000010040116a <+58>: movaps (%rcx),%xmm0
0x000000010040116d <+61>: movups %xmm0,0x4(%rax)
0x0000000100401171 <+65>: mov 0x20(%rsp),%rax
0x0000000100401176 <+70>: add $0x28,%rsp
0x000000010040117a <+74>: retq
0x000000010040117b <+75>: nopl 0x0(%rax,%rax,1)
Works for clang (all opt levels) if you add -fno-builtin-memcpy:
$ set -x ; CC="clang -O3 -fno-builtin-memcpy -Wall" ; $CC -DONE -c clangcp.c -o one.o && $CC clangcp.c one.o && ./a.exe
+ set -x
+ CC='clang -O3 -fno-builtin-memcpy -Wall'
+ clang -O3 -fno-builtin-memcpy -Wall -DONE -c clangcp.c -o one.o
+ clang -O3 -fno-builtin-memcpy -Wall clangcp.c one.o
+ ./a.exe
ok
Works for clang (all opt levels) if STRICT is defined, which causes the declarations of a1 and a2 to be type-equivalent in both objects, rather than just size-equivalent.
$ set -x ; CC="clang -O0 -Wall" ; $CC -DONE -c clangcp.c -o one.o && $CC clangcp.c -DSTRICT one.o && ./a.exe
+ set -x
+ CC='clang -O0 -Wall'
+ clang -O0 -Wall -DONE -c clangcp.c -o one.o
+ clang -O0 -Wall clangcp.c -DSTRICT one.o
+ ./a.exe
ok
Analysis:
TL;DR: Clang may be justified in performing this optimization, and UPCR is probably in the wrong by type-punning our symbols across objects. However, UPCR's design for struct pthread TLD relies heavily on this admitted hack working as expected.
The good news is I believe the problematic type-punning is constrained to startup init (where performance is non-critical), because after that point everything is structure fields within a single object (where we perform other type-punning, but should be respecting the ABI struct field alignment conventions). Furthermore with my new understanding of the problem, I believe I can deploy a surgical fix to defeat the problematic rewriting of this memcpy.
First, I want to clarify that this is a bug report against the clang C compiler built from the current repo, not the UPC compiler. However, the problem appears to be bad code generated for the TLD initialization performed by the Berkeley UPC runtime library. Specifically it looks like
movaps
instructions are being generated for addresses which are not 16-byte aligned.There are currently a handful of tests that are failing in our automated testing on an x86-64/Linux platform, when using the Berkeley UPC compiler (translator+runtime) configured to use Clang (build from the clang-upc repo) as the back-end compiler:
The problem appears only when back-end optimizations are enabled (
-O1
,-O2
and-O3
all fail). The core dumps for all cases, show that the error is in the initialization of thread-local-data (and thus only occurs when -pthreads is passed to upcc), and specifically in code that results from converting a memcpy() call with constant length into a short sequence of MMX instructions:I can work around the problem by passing either
-O0
or-fno-builtin-memcpy
to the back-end clang compiler.I am pretty sure from viewing such dumps from all six failing codes that the SEGV is always from a
movaps
instruction (perhaps even the first one?) even after multiplemovups
instructions (though not seen in the dump above from test24). The difference between those instructions is thatmovaps
is only valid when the address is 16-byte aligned. Based on the address 0x6a9db8 in comment for the disassembly of the faulting instruction, that appears not to be the case.Given the machine-generated nature of the TLD initialization code, I do not currently have a simple reproducer. Note, however, that the faulting instruction is loading from initialized static data (hence the IP-relative addressing). So, I do not think this is at all related to
__thread
.This does not appear to be Clang bug 26779 because passing
-mstackrealign
did not resolve the problem.I have not yet had the opportunity to test against upstream clang.