Generated LLVM code causes register spills

aravindh-krishnamoorthy commented 10 months ago

This issue stems from PR #52438 and concerns the following function on my PC with Windows 11/WSL on Intel hardware (see below). But I suspect that the problem may also apply to other hardware types.

julia> versioninfo()
Julia Version 1.11.0-DEV.1038
Commit 290cfcee73* (2023-12-16 08:49 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores

julia> function f(x::Tuple, shift::Integer)
           @inline
           j = mod1(shift, length(x))
           ntuple(k -> Base.__safe_getindex(x, k-j+ifelse(k>j,0,length(x))), Val(length(x)))
       end
f (generic function with 1 method)

Consider the LLVM code generated for the following case where the parameter shift is a variable, i.e., shift without constant propagation:

julia> x = Tuple(collect(1:50))
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50)

julia> @code_llvm f(x, 1)

The generated LLVM code has the following structure:

; ifelse + ...
; index 1
%8 = icmp sgt i64 %7, 0
%9 = select i1 %8, i64 10, i64 0
%10 = sub i64 %9, %7
%memcpy_refined_src = getelementptr inbounds [10 x i64], [10 x i64]* %"x::Tuple", i64 0, i64 %10
;
; indices 2 through 49
;
; index 50
%.inv = icmp slt i64 %7, 50
%203 = select i1 %.inv, i64 0, i64 50
%204 = sub i64 49, %7
%205 = add i64 %204, %203
%memcpy_refined_src399 = getelementptr inbounds [50 x i64], [50 x i64]* %"x::Tuple", i64 0, i64 %205

; load
; index 1
%206 = load i64, i64* %memcpy_refined_src, align 8
;
; indices 2 through 49
;
; index 50
%255 = load i64, i64* %memcpy_refined_src399, align 8

; new tuple element / store
; index 1
%"new::Tuple.sroa.0.0..sroa_idx" = getelementptr inbounds [50 x i64], [50 x i64]* %sret_return, i64 0, i64 0
store i64 %206, i64* %"new::Tuple.sroa.0.0..sroa_idx", align 8
;
; indices 2 through 49
;
; index 50
%"new::Tuple.sroa.50.0..sroa_idx449" = getelementptr inbounds [50 x i64], [50 x i64]* %sret_return, i64 0, i64 49
store i64 %255, i64* %"new::Tuple.sroa.50.0..sroa_idx449", align 8

Note that in the ifelse, load, and new/store portions, the instructions for all indices are bunched together. Next, consider the generated native code:

julia> @code_native f(x, 1)

The same structure is also carried forward into assembly (not shown here) despite there being no memory aliases! This significantly increases the register pressure and leads to a large number of register spills and reloads (8 bytes each). The spills and reloads degrade the performance of the function. Furthermore, the effects get worse as the size of the tuple increases.

One can readily see that the register spills and reloads can be completely avoided by having the ifelse, load, and store portions index by index.

This topic was brought up on Julia Discourse where it was suggested that looking into new aliasing annotations for LLVM might be useful, see link for details.

Note 1: The case **with** constant propagation does not have this issue. The resulting code is excellent.

The LLVM and native code for this case can be obtained as follows: ```julia julia> g(x) = f(x, 1) g (generic function with 1 method) julia> @code_llvm g(x) [...] julia> @code_native g(x) [...] ```

Note 2 ¹: Special solutions are sometimes found for some cases (integer tuples) but not for others.

The LLVM and native code for these case can be obtained as follows: ```julia julia> x = Tuple(collect(1:50)) (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50) julia> g(x,i) = f(x,i) g (generic function with 1 method) # LLVM code is Ok but deterministic. julia> @code_llvm g(x,1) [...] julia> x = Tuple(repeat(['a', -1, 1.0], 20)) ('a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0) # LLVM code again has the issue mentioned above. julia> @code_llvm g(x,1) [...] ```

Note: unicode superscripts denote edits.

gbaraldi commented 10 months ago

I'm taking a look at this, but this has a C reproducer https://godbolt.org/z/bGPT4WxPe and GCC does a lot better

aravindh-krishnamoorthy commented 10 months ago

I'm taking a look at this, but this has a C reproducer https://godbolt.org/z/bGPT4WxPe and GCC does a lot better

Thank you @gbaraldi for taking this up.

For the C code, I'd just make the single change below. With the change, gcc is still good, but not as good as with the original UB. I feel it only packs the results in XMM registers instead of the register spills and reloads... and might also not be scalable?

struct wtf {
-    int a[20];
+    int a[30];
};

Perhaps there's some hidden dependency that I'm not seeing? :(

gbaraldi commented 10 months ago

Yeah, that was a typo, could you try incfreasing it to 50 or 100? Though i'm not surprised if at that size it gets so bad. I opened https://github.com/llvm/llvm-project/issues/78506 upstream

aravindh-krishnamoorthy commented 10 months ago

Yeah, that was a typo, could you try incfreasing it to 50 or 100? Though i'm not surprised if at that size it gets so bad. I opened llvm/llvm-project#78506 upstream

Thank for for raising the upstream issue, @gbaraldi. Actually, seems like it takes a lot to break GCC! This is the general process to check the assembly for any N:

Copy the awk script below to rs.awk
Generate the C file for 50, $ awk -v N=50 -f rs.awk rs.awk > rs50.c. For other sizes, please just change -v N=50
Generate the assembly, $ gcc -S -masm=intel -O3 -fverbose-asm rs50.c

rs.awk

```awk $0 ~ /^##.*/ {gsub("^## ",""); print} $0 ~ /^#~.*/ {gsub("^#~ ",""); gsub("N",N); if($0 ~ /?/) {for (i=0; i ## struct wtf { #~ int a[N]; ## }; ## struct wtf __attribute__ ((noinline)) foo(struct wtf *b, int i) ## { ## struct wtf new; #~ int idx? = (? + i) % N ; #~ int val? = b->a[idx?] ; #~ new.a[?] = val? ; ## return new; ## } ## ```

gbaraldi commented 10 months ago

I get a

awk: rs.awk: line 2: regular expression compile failed (missing operand)
?
awk: rs.awk: line 2: regular expression compile failed (missing operand)
?

error

aravindh-krishnamoorthy commented 10 months ago

I get a

awk: rs.awk: line 2: regular expression compile failed (missing operand)
?
awk: rs.awk: line 2: regular expression compile failed (missing operand)
?

error

Sorry, I think you have a POSIX compatible awk but this one uses extensions from GNU awk. My version is as follows:

$ awk --version
GNU Awk 5.2.1, API 3.2, PMA Avon 8-g1, (GNU MPFR 4.2.0, GNU MP 6.2.1)
Copyright (C) 1989, 1991-2022 Free Software Foundation.

And the POSIX version fails here too:

$ awk -P -v N=10 -f rs.awk rs.awk
awk: rs.awk:2: error: Invalid preceding regular expression: /?/

gbaraldi commented 10 months ago

my awk doesn't even have a -- version option :laughing: . But gawk worked

aravindh-krishnamoorthy commented 10 months ago

my awk doesn't even have a -- version option 😆 . But gawk worked

Sorry, seems like I got the for loop wrong. The correct version is for (i=0; i<N; i++), which starts from $i=0.$

$0 ~ /^##.*/ {gsub("^## ",""); print}
- $0 ~ /^#~.*/ {gsub("^#~ ",""); gsub("N",N); if($0 ~ /?/) {for (i=1; i<N; i++) {a = $0; gsub("?",i,a); print a}} else {print}}
+ $0 ~ /^#~.*/ {gsub("^#~ ",""); gsub("N",N); if($0 ~ /?/) {for (i=0; i<N; i++) {a = $0; gsub("?",i,a); print a}} else {print}}

## #include <stddef.h>
## struct wtf {
#~     int a[N];
## };
## struct wtf __attribute__ ((noinline))  foo(struct wtf *b, int i)
## {
##     struct wtf new;
#~     int idx? = (? + i) % N ;
#~     int val? = b->a[idx?] ;
#~     new.a[?] = val? ;
##     return new;
## }
##

I've also corrected it above.

JuliaLang / julia

Generated LLVM code causes register spills #52933