Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

excessive stack usage compiling linux amdgpu kernel driver #41522

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR42551
Status NEW
Importance P enhancement
Reported by Arnd Bergmann (arnd@linaro.org)
Reported on 2019-07-09 01:27:31 -0700
Last modified on 2019-07-09 09:24:45 -0700
Version trunk
Hardware PC Linux
CC htmldeveloper@gmail.com, kristof.beyls@arm.com, llvm-bugs@lists.llvm.org, ndesaulniers@google.com, neeilans@live.com, oliver.stannard@arm.com, richard-llvm@metafoo.co.uk, smithp352@googlemail.com
Fixed by commit(s)
Attachments dce-calcs-clang.c (50187 bytes, text/x-csrc)
dce-calcs-clang.i.gz (260283 bytes, application/gzip)
bw-fixed.patch (7621 bytes, text/plain)
Blocks PR4068
Blocked by
See also
Created attachment 22208
preprocessed and partially reduced file

One file in the linux kernel appears to trigger a failed optimization that
leads to large stack usage. Compiling a 32-bit ARM defconfig with the amdgpu
driver enabled, I get

drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:2987:6: error: stack
frame size of 1344 bytes in function 'bw_calcs' [-Werror,-Wframe-larger-than=]
bool bw_calcs(struct dc_context *ctx,
     ^
drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:76:13: error: stack
frame size of 5328 bytes in function 'calculate_bandwidth' [-Werror,-Wframe-
larger-than=]

I managed to partially reduce the preprocessed source file to illustrate the
problem better:

https://godbolt.org/z/ZlDp0Z

The stack usage is highly target architecture dependent:

gcc arm-linux-gnueabi:      208 bytes
clang-8 arm-linux-gnueabi: 4144 bytes
clang-9 arm-linux-gnueabi: 4992 bytes
clang-9 aarch64-linux-gnu:  272 bytes
clang-9 powerpc64:         4272 bytes
clang-9 powerpc32:         4112 bytes
clang-9 s390-32:           4168 bytes
clang-9 s390-64:           4168 bytes
clang-9 sparc32:          10272 bytes
clang-9 sparc64:            432 bytes

$ clang-9 -Wframe-larger-than=10 --target=arm-linux-gnu -O2 -fno-strict-
overflow -S dce-calcs-clang-noinline.i   -m32 -Wno-unused-value -Wno-logical-op-
parentheses -Wno-return-type -Wno-implicit-int
dce-calcs-clang.i:275:4: warning: stack frame size of 4944 bytes in function
'calculate_bandwidth' [-Wframe-larger-than=]
   calculate_bandwidth(struct bw_calcs_dceip *dceip, struct bw_calcs_vbios *vbios,                     struct bw_calcs_data *data) {
   ^

Note that we build 32-bit kernels with -Wframe-larger-than=1024 because of the
highly limited available stack space, and using 5KB of stack is likely to
result in a kernel crash.
Quuxplusone commented 5 years ago

Attached dce-calcs-clang.c (50187 bytes, text/x-csrc): preprocessed and partially reduced file

Quuxplusone commented 5 years ago

Attached dce-calcs-clang.i.gz (260283 bytes, application/gzip): original source file, preprocessed and compressed

Quuxplusone commented 5 years ago

Attached bw-fixed.patch (7621 bytes, text/plain): kernel patch to avoid passing structures by value

Quuxplusone commented 5 years ago
Got a better reduced test case, see https://godbolt.org/z/z5bVKS

struct bw_fixed { long long value; };

struct bw_fixed bw_min2(struct bw_fixed, struct bw_fixed);
struct bw_fixed bw_max2(struct bw_fixed, struct bw_fixed);
struct bw_fixed bw_mul(struct bw_fixed, struct bw_fixed);

static inline struct bw_fixed bw_max3(struct bw_fixed v1, struct bw_fixed v2,
struct bw_fixed v3) {
    return bw_max2(bw_max2(v1, v2), v3);
}

struct bw_fixed bw_int_to_fixed(long long value);

int f(struct bw_fixed _a, struct bw_fixed _b, struct bw_fixed _c)
{
    struct bw_fixed a=_a, b=_b, c=_c;
    struct bw_fixed a1=_a, b1=_b, c1=_c;
    struct bw_fixed a2=_a, b2=_b, c2=_c;

    a1 = bw_mul(bw_int_to_fixed(a.value), bw_int_to_fixed(3));
    b1 = bw_max3(a1, bw_mul(a2, a1), bw_mul(b2, c1));
    c1 = bw_max3(a2, bw_int_to_fixed(3), bw_int_to_fixed(2));

    return bw_max3(a1, b1, c1).value;
}

This just combines a couple of random operations from the original
kernel file. What can be seen here is that clang on ARM allocates
a stack slot for each temporary value resulting from calling
bw_int_to_fixed() or bw_max(), while it doesn't do that on x86-64,
and gcc never does it.