facebookarchive / BOLT

Binary Optimization and Layout Tool - A linux command-line utility used for optimizing performance of binaries
2.51k stars 177 forks source link

Why skip-function when it jump to itself? #298

Closed CcWeapon closed 2 years ago

CcWeapon commented 2 years ago

in BinaryFunction.cpp::BinaryFunction::disassemble()

        if (IsCall && containsAddress(TargetAddress)) {
          if (TargetAddress == getAddress()) {
            // Recursive call.
            TargetSymbol = getSymbol();
          } else {
            if (BC.isX86()) {
              // Dangerous old-style x86 PIC code. We may need to freeze this
              // function, so preserve the function as is for now.
              PreserveNops = true;
            } else {
              errs() << "BOLT-WARNING: internal call detected at 0x"
                     << Twine::utohexstr(AbsoluteInstrAddr) << " in function "
                     << *this << ". Skipping.\n";
              IsSimple = false;
            }
          }
        }

my device is aarch64, and bolt is set to false for the IsSimple of this func. What's the reason for bolt doing this?

yota9 commented 2 years ago

Hello @CcWeapon ! This is good question, I think it is there for historical reasons. I was thinking about removing this, but the case is quite rare and I didn't found it in my code base yet. To remove this I think we will need to write a few tests just to be sure everything works fine. Right now I don't see reasons why we won't handle such cases properly.. Maybe @maksfb or @rafaelauler might tell something more about it..

rafaelauler commented 2 years ago

If you reaching this line, it means your input binary has a weird function with a call to an address that is inside the same function of the call. It is not a recursive call because the call target is not the function address itself, but rather a block inside it. It is doing something like this:

func: ... instructions basicblock-a: .... instructions basic-block-b: call basicblock-a

We expect control transfer inside the same function to happen through jump/branch instructions. When we detect a call to a block inside the same function, we conservatively assume we do not understand the CFG for this function and mark it as non simple, which means we will not optimize this function or change anything there.

In X86, there are some reasons you might have a call like that. For example, to get current PC address via return address:

func: call label label: pop %rax ; RAX now has label address

rafaelauler commented 2 years ago

Example of code exercising this case:

// This is reduced test case for BOLT containing an internal call based on // GetCoreDump (from google core dumper).

.text .globl getCallback .type getCallback, %function getCallback: .cfi_startproc pushq %rbp movq %rsp, %rbp pushq %r12 pushq %rbx subq $288, %rsp callq .Lnext_instr .Lnext_instr: popq %rax addq $17, %rax addq $288, %rsp popq %rbx popq %r12 popq %rbp retq .Lweird_callback: mov $0xDEADBEEF, %rax retq .cfi_endproc

For this code, we probably need to mark the function as ignored, and not as non-simple. Ignore means we will not move it.

yota9 commented 2 years ago

I agree that the case is weird, but just wondering, but are there any real blocking things that prevents us from optimizing such things? Yes, there would be fall-through BB and we will create extra entry point, but actually it seems that everything should work fine. Or it us just an assumption that something weird asm-written might going on here and it is better not to touch such things?

CcWeapon commented 2 years ago

You can find art_quick_osr_stub() in AOSP, it calls .Losr_entry in itself. If I don't set this function to simple, what error might happen?

ENTRY art_quick_osr_stub
    SPILL_ALL_CALLEE_SAVE_GPRS             @ Spill regs (9)
    vpush  {s16-s31}                       @ Spill fp-regs (16)
    .cfi_adjust_cfa_offset 64
    SAVE_SIZE=(9*4+16*4)
    mov    r11, sp                         @ Save the stack pointer
    .cfi_def_cfa r11, SAVE_SIZE            @ CFA = r11 + SAVE_SIZE
    .cfi_remember_state
    mov    r10, r1                         @ Save size of stack
    ldr    r9, [r11, #(SAVE_SIZE+4)]       @ Move managed thread pointer into r9
    REFRESH_MARKING_REGISTER
    mov    r6, r2                          @ Save the pc to call
    sub    r7, sp, #12                     @ Reserve space for stack pointer,
                                           @    JValue* result, and ArtMethod* slot.
    and    r7, #0xFFFFFFF0                 @ Align stack pointer
    mov    sp, r7                          @ Update stack pointer
    str    r11, [sp, #4]                   @ Save old stack pointer
    str    r3, [sp, #8]                    @ Save JValue* result
    mov    ip, #0
    str    ip, [sp]                        @ Store null for ArtMethod* at bottom of frame
    // r11 isn't properly spilled in the osr method, so we need use DWARF expression.
    // NB: the CFI must be before the call since this is the address gdb will lookup.
    // NB: gdb expects that cfa_expression returns the CFA value (not address to it).
    .cfi_escape                            /* CFA = [sp + 4] + SAVE_SIZE */ \
      0x0f, 6,                             /* DW_CFA_def_cfa_expression(len) */ \
      0x92, 13, 4,                         /* DW_OP_bregx(reg,offset) */ \
      0x06,                                /* DW_OP_deref */ \
      0x23, SAVE_SIZE                      /* DW_OP_plus_uconst(val) */
    bl     .Losr_entry                     @ Call the method
    ldr    r10, [sp, #8]                   @ Restore JValue* result
    ldr    sp, [sp, #4]                    @ Restore saved stack pointer
    .cfi_def_cfa sp, SAVE_SIZE             @ CFA = sp + SAVE_SIZE
    ldr    r4, [sp, #SAVE_SIZE]            @ load shorty
    ldrb   r4, [r4, #0]                    @ load return type
    cmp    r4, #68                         @ Test if result type char == 'D'.
    beq    .Losr_fp_result
    cmp    r4, #70                         @ Test if result type char == 'F'.
    beq    .Losr_fp_result
    strd r0, [r10]                         @ Store r0/r1 into result pointer
    b    .Losr_exit
.Losr_fp_result:
    vstr d0, [r10]                         @ Store s0-s1/d0 into result pointer
.Losr_exit:
    vpop   {s16-s31}
    .cfi_adjust_cfa_offset -64
    pop    {r4, r5, r6, r7, r8, r9, r10, r11, pc}
.Losr_entry:
    .cfi_restore_state
    .cfi_def_cfa r11, SAVE_SIZE            @ CFA = r11 + SAVE_SIZE
    sub sp, sp, r10                        @ Reserve space for callee stack
    sub r10, r10, #4
    str lr, [sp, r10]                      @ Store link register per the compiler ABI
    mov r2, r10
    mov r1, r0
    mov r0, sp
    bl  memcpy                             @ memcpy (dest r0, src r1, bytes r2)
    bx r6
END art_quick_osr_stub
rafaelauler commented 2 years ago

My worry with the function I pasted is that we have an unidentified code reference that we will not update / will break if we move these functions:

callq .Lnext_instr
.Lnext_instr:
popq %rax
addq $17, %rax   ;  <--  17 bytes of distance after .Lnext_instr, pointing to .Lweird_callback
addq $288, %rsp
popq %rbx
popq %r12
popq %rbp
retq
.Lweird_callback:   ;  <-- we can't move this block
mov $0xDEADBEEF, %rax
retq
.cfi_endproc

In the example above, the immediate $17 will not be updated by BOLT, and moving this code will break it. That's why we conservatively put this function into a block list.

CcWeapon commented 2 years ago

thx~