[Sprint 3] [C-VII] Existing Pass: SimplifyCFG

heatz123 commented 1 year ago

SimplifyCFG

Issue link: #48, #49

Note that this pass should be merged after #46.

llvm 공식 docs의 simplifyCFG 설명에 따르면, 본 패스는 해당 기능을 수행한다.

Removes basic blocks with no predecessors.
Merges a basic block into its predecessor if there is only one and the predecessor only has one successor.
Eliminates PHI nodes for basic blocks with a single predecessor.
Eliminates a basic block that only contains an unconditional branch.

1. 착안점

SimplifyCFG는 merge 가능한 basic block을 합칠 수 있는 기능이 있기에, add-to-sum, aload, oracle 등 basic block 내에서 잘 작동하는 pass와 시너지를 낼 수 있다.
49 (Existing Pass의 벤치마킹 결과 정리)에 따라, 실증적으로 SimplifyCFG를 가져오는 것이 최선이라고 판단하였다.

2. 최적화 개요

SimplifyCFGPass를 호출한다. 이때, 추가적인 옵션을 설정하여 최적화를 개선한다.

3. 구현 방식

SimplifyCFGPass에서 추가적으로 설정한 옵션은 다음과 같다.

  Opts.needCanonicalLoops(false);
  Opts.hoistCommonInsts(true);
  Opts.sinkCommonInsts(true);

CanonicalLoop를 유지하면 cost가 추가적으로 발생할 수 있으므로 false로 설정하였다.
insts hoisting과 sinking을 통해 추가적인 optimization을 유도하였다.

또한 SimplifyCFG는 redundant한 basic block의 merge를 수행하므로, 해당 basic block이 만들어질 수 있는

프로그램 초기
LCSSAPass (+ Unrolling Pass)
Function Inlining Pass 이후에 반복하여 SimplifyCFG를 적용하였다.

4. 유닛 테스트

fold_branch.ll: Check if SimplifyCFGPass folds branch to a common destination. hoist_common_insts.ll: Check if SimplifyCFGPass hoists common instructions. merge.ll: ; Check if SimplifyCFGPass merges basic blocks with unconditional branches. phi_elimiate.ll: Check if SimplifyCFGPass merges empty basic blocks and replaces the phi node to select instruction. switch.ll: Check if simplifyCFGPass converts if-else tree to switch.

예시: merge.ll

define void @test(i1 %c) {
; CHECK-LABEL: @test(
; CHECK-NEXT:  T:
; CHECK-NEXT:    [[X:%.*]] = add i32 0, 0
; CHECK-NEXT:    call void @use(i1 [[C:%.*]])
; CHECK-NEXT:    ret void
;
  br label %T
T:
  %x = add i32 0, 0
  br label %F
F:
  call void @use( i1 %c )
  ret void
}

해당 basic block은 SimplifyCFG 이후 합쳐진다.

5. 성능 비교 (이전/이후)

성능 변화가 크진 않으나, 많은 프로그램에 적용이 가능한 것을 확인하였다.

heatz123 commented 1 year ago

참고로 hoisting을 적용하였을 때 / 적용하지 않았을 때 성능 차이는 다음과 같았습니다. cost_gain (전: 적용 / 후 : 미적용)

heatz123 commented 1 year ago

수고하셨습니다. 그런데 혹시 gcd에서 성능 감소가 나는 이유가 있을까요? gcd같은 경우 모든 input에 대회 최종 cost가 100~200정도밖에 안되기에 10% 성능 감소라는 의미는 한두개의 instruction이 추가되었기에 생긴 문제라 판단되는데, 보통 eliminate만 하는 SimplifyCFG에서 뭔가를 추가할만한 계기가 있는지 궁금합니다. 물론 gcd는 이미 음수의 최적화가 진행되는 중이기에 gcd 최적화를 0%로 올리더라도 점수는 동일할 것 같기는 하지만, 그래도 이후 추가로 gcd최적화가 진행된다면 약간 문제가 생길 수 있을 것 같습니다.

사실 이 부분에서 왜 성능 하락이 발생하는지 알아내는 데 시간이 조금 걸렸는데요,

일단 llvm 변환을 보시면 이전:

; Function Attrs: nounwind uwtable
define dso_local i64 @gcd(i64 noundef %x, i64 noundef %y) #0 {
entry:
  %cmp = icmp eq i64 %x, 0
  br i1 %cmp, label %if.then, label %if.end

if.then:                                          ; preds = %entry
  br label %return

if.end:                                           ; preds = %entry
  %cmp1 = icmp eq i64 %y, 0
  br i1 %cmp1, label %if.then2, label %if.end3

if.then2:                                         ; preds = %if.end
  br label %return

if.end3:                                          ; preds = %if.end
  %cmp4 = icmp ugt i64 %x, %y
  br i1 %cmp4, label %if.then5, label %if.else

if.then5:                                         ; preds = %if.end3
  %rem = urem i64 %x, %y
  br label %if.end7

if.else:                                          ; preds = %if.end3
  %rem6 = urem i64 %y, %x
  br label %if.end7

if.end7:                                          ; preds = %if.else, %if.then5
  %x.addr.0 = phi i64 [ %rem, %if.then5 ], [ %x, %if.else ]
  %y.addr.0 = phi i64 [ %y, %if.then5 ], [ %rem6, %if.else ]
  %call = call i64 @gcd(i64 noundef %x.addr.0, i64 noundef %y.addr.0)
  br label %return

return:                                           ; preds = %if.end7, %if.then2, %if.then
  %retval.0 = phi i64 [ %y, %if.then ], [ %x, %if.then2 ], [ %call, %if.end7 ]
  ret i64 %retval.0
}

이후:

; Function Attrs: nounwind uwtable
define dso_local i64 @gcd(i64 noundef %x, i64 noundef %y) #0 {
entry:
  %cmp = icmp eq i64 %x, 0
  br i1 %cmp, label %return, label %if.end

if.end:                                           ; preds = %entry
  %cmp1 = icmp eq i64 %y, 0
  br i1 %cmp1, label %return, label %if.end3

if.end3:                                          ; preds = %if.end
  %cmp4 = icmp ugt i64 %x, %y
  br i1 %cmp4, label %if.then5, label %if.else

if.then5:                                         ; preds = %if.end3
  %rem = urem i64 %x, %y
  br label %if.end7

if.else:                                          ; preds = %if.end3
  %rem6 = urem i64 %y, %x
  br label %if.end7

if.end7:                                          ; preds = %if.else, %if.then5
  %x.addr.0 = phi i64 [ %rem, %if.then5 ], [ %x, %if.else ]
  %y.addr.0 = phi i64 [ %y, %if.then5 ], [ %rem6, %if.else ]
  %call = call i64 @gcd(i64 noundef %x.addr.0, i64 noundef %y.addr.0)
  br label %return

return:                                           ; preds = %if.end, %entry, %if.end7
  %retval.0 = phi i64 [ %call, %if.end7 ], [ %y, %entry ], [ %x, %if.end ]
  ret i64 %retval.0
}

보시는 것처럼 if.then, if.then2 basicblock이 없어지고 control flow가 simplify되긴 했습니다.

그런데 어셈블리 변환될 때 미묘하게 instruction이 하나 추가됩니다. 이전:

start gcd 2:
.entry:
r1 = icmp eq arg1 0 64
br r1 .if.then .if.end
.if.then:
r1 = mul arg2 1 64
br .return
.if.end:
r1 = icmp eq arg2 0 64
br r1 .if.then2 .if.end3
.if.then2:
r1 = mul arg1 1 64
br .return
.if.end3:
r1 = icmp ule arg1 arg2 64
br r1 .if.else .if.then5
.if.then5:
r1 = urem arg1 arg2 64
r2 = mul arg2 1 64
br .if.end7
.if.else:
r2 = urem arg2 arg1 64
r1 = mul arg1 1 64
br .if.end7
.if.end7:
r1 = call gcd r1 r2
br .return
.return:
ret r1
end gcd

이후:

start gcd 2:
.entry:
r1 = icmp eq arg1 0 64
r2 = mul arg2 1 64 ; added
br r1 .return .if.end
.if.end:
r1 = icmp eq arg2 0 64
r2 = mul arg1 1 64
br r1 .return .if.end3
.if.end3:
r1 = icmp ule arg1 arg2 64
br r1 .if.else .if.then5
.if.then5:
r1 = urem arg1 arg2 64
r2 = mul arg2 1 64
br .if.end7
.if.else:
r2 = urem arg2 arg1 64
r1 = mul arg1 1 64
br .if.end7
.if.end7:
r2 = call gcd r1 r2
br .return
.return:
ret r2
end gcd

entry에 r2 = mul arg2 1 64이 추가된 게 보이시나요? control flow가 간단해지면서 return block은 entry를 predecessor로 갖게 되고, return block에서 phi를 처리하는 과정에서 entry에 %y를 복사해서 붙인 것입니다.

CFG에서 오히려 성능이 떨어지는 케이스는, 이런 식으로 백엔드에서 phi 처리하는 부분과 관련있는 문제로 보입니다.

pzmaus commented 1 year ago

gcd저격패스를 따로 만들까 고민하고 있었는데 방법이 떠오르지 않네요.. 해당 pass를 여러 곳에 삽입해서 시간을 줄인 것처럼, 다른 pass들도 가능한지 한번 살펴보겠습니다. 수고하셨습니다!

SyphonArch / swpp202301-compiler-team1