facebookarchive / BOLT

Binary Optimization and Layout Tool - A linux command-line utility used for optimizing performance of binaries
2.51k stars 176 forks source link

Fix LongJmp pass #222

Closed yota9 closed 2 years ago

yota9 commented 2 years ago

This patch handles 2 problems with LongJmp pass:

  1. The pass should be executed before FinalizeFunctions, since the pass may add new entry points for the function, and the BinaryFunction::addEntryPoint has an assert "CurrentState == State::CFG"
  2. Replaced shortJmp implementation with position-independent code. Currently we could handle PIC binaries with max +-4Gb offsets, the longJmp uses absolute addreses and could be used only in non-PIE binaries.

Vladislav Khmelevsky, Advanced Software Technology Lab, Huawei

yota9 commented 2 years ago

I'm a little bit in doubts about moving the longjmp pass upper. Maybe I should just add CFG_Finalized condition to assert in addEntryPoint? What do you think @rafaelauler ?

rafaelauler commented 2 years ago

I think it makes sense to move the pass before the finalize pass because we are doing modifications on the functions (inserting basic blocks), and if the CFG is marked as finalized, we shouldn't be doing modifications on it. Frame optimizations are a weird exception, but it doesn't run for AArch64, so we don't need to worry about it.

yota9 commented 2 years ago

Yes, I agree with you. But should not we also move PatchEntries before LongJmp pass? It seems to be logical. UPD originally it was before, I will move it upper now

rafaelauler commented 2 years ago

Good point, you're right. But PatchEntries was created for lite relocation mode (-relocs=1 -lite=1), when we don't want to process all functions, but just process hot code. In this mode, code that jumps to the old location of a hot function may still exist, and we need to patch these old locations. I never personally tested AArch64 with that. Is AArch64 using that mode? If that mode is not mature for AArch64, we should probably enforce lite=0 for AArch64.

Assuming we want to support -lite=1 for AArch64, I'm not sure LongJmp will handle correctly these injected functions that PatchEntries is creating. Perhaps we could try to patch with a longer code sequence that does not need to be fixed up by LongJmp? Either that, or we need to test that LongJmp is doing the right thing for these injected functions that PatchEntries is creating.

yota9 commented 2 years ago

Good point! I will add this to our backlog task for testing some time later :) But anyway it is a bit out of scope of this patch, let's return to this question a little bit later

rafaelauler commented 2 years ago

Some (internal) tests are crashing after PatchEntries got moved before finalize functions, but I didn't investigate exactly why yet. Meanwhile would you prefer to commit the previous version that does not move PatchEntries? We are working on upstreaming some internal tests, so hopefully that crash will be visible externally.

My advice would be to do not rely on LongJmp to fix PatchEntries, and to emit the longest branch sequence right off the bat.

yota9 commented 2 years ago

Some (internal) tests are crashing after PatchEntries got moved before finalize functions

It will be interesting to see why, I could not see a reason now..

My advice would be to do not rely on LongJmp to fix PatchEntries, and to emit the longest branch sequence right off the bat.

I've reverted the PatchEntries move for now and changed the tailcall instruction to shortjmp creation, I think it might be the solution, what do you think?

rafaelauler commented 2 years ago

From Maksim: This will likely break x86-64 functionality, as using short jump will result in wrong size estimation for the patch. Would it make sense to add another version of tail call for AArch64 and use it in PatchEntries?

yota9 commented 2 years ago

Oh I see, thank you!

Would it make sense to add another version of tail call

Smth like createLongTailCall? Maybe just under if ARCH condition create tailcall for x86 and shortJmp for aarch64? + Comment why it is needed. Or in case of creating createLongTailCall for both architectures I think we could replace the JMP_4 to JMP_1 in original createDirectCall function?

maksfb commented 2 years ago

createLongTailCall() appears more preferable to me. For x86 implementation, you can simply invoke createTailCall(). While lowering tail calls, we convert JMP_4 to JMP_1 and rely on MC relaxation to convert back to JMP_4 when needed. There's no need to modify createDirectCall().

yota9 commented 2 years ago

@maksfb Thank you for your answer! By lowering tail calls you mean the shortenInstruction call? I was meaning that for tail calls we are currently creating JMP_4 instruction in createDirectCall Inst.setOpcode(IsTailCall ? X86::JMP_4 : X86::CALL64pcrel32); So maybe creating relaxed jmp instruction here was made specially for PatchEntries pass. If I will create createLongTailCall() than I can replace it for both x86 and aarch64 in the PatchEntries pass and I thought maybe I can lower JMP_4 to JMP_1 in createDirectCall, if there are no more special cases like the PatchEntries pass?

maksfb commented 2 years ago

Sorry, I meant that InstructionLowering pass will invoke lowerTailCall() for JMP_4 -> JMP_1 conversion. It will run after PatchEntries.

Having a single/common call to createLongTailCall() in PatchEntries sounds good to me.

yota9 commented 2 years ago

@maksfb Now I see, thank you! Ok I won't change the default one if it does not really matter :)

yota9 commented 2 years ago

@maksfb I've updated the patch. How do you think is there a reason to mark branch in short jmp with setTailCall? It seems to be no actual optimizations will require this on the PatchEntries moment, but it might be good for universality, what do you think?

maksfb commented 2 years ago

Great! Yes, it will be good to mark the instruction as a tail call for consistency.

yota9 commented 2 years ago

@maksfb @rafaelauler I think I've found one more issue here occasionally, in relax() the DotAddress was not incremented for the instruction that needs relaxation.

rafaelauler commented 2 years ago

@maksfb @rafaelauler I think I've found one more issue here occasionally, in relax() the DotAddress was not incremented for the instruction that needs relaxation.

Good catch!