Move disassemble optimizations to optimization passes

yota9 commented 2 years ago

The patch moves the shortenInstructions and nop remove to separate binary passes. As a result when llvm-bolt optimizations stage will begin the instructions of the binary functions will be absolutely the same as it was in the binary. This is needed for the golang support by llvm-bolt. Some of the tests must be changed, since bb alignment nops might create unreachable BBs in original functions.

Vladislav Khmelevsky, Advanced Software Technology Lab, Huawei

maksfb commented 2 years ago

Thank you for this pull request. It's useful beyond just golang support.

yota9 commented 2 years ago

@maksfb Thank you for your comments, done!

Did you verify that LLVM disassembler correctly detects all of them and generated code matches the input

I have a problem with it indeed. For multi-byte nop instructions I've got a hack during disassembly used for golang support:

    if (BC.isX86() && BC.MIB->isNoop(Instruction) && Size > 1) {
      MIB->addAnnotation(Instruction, "Size", static_cast<uint32_t>(Size));
    }

I didn't investigate this problem further currently..

getianao commented 2 years ago

That's a great patch and is useful for me. Can't wait to see the patch being merged.

yota9 commented 2 years ago

Hello @maksfb ! It seems to be the test is incomplete. The makefile is wrong, e.g. to build the object file it needs to build bolt file first. I've fixed that but I'm unable to build it, there are some compiler errors + there is no main function. I'm tried to bolt the executable that was in the archive and it is finished fine, but test seems to be want to read smth from stdio, so I'm not sure how to deal with it :) Also rebased it to the latest commits.

maksfb commented 2 years ago

Hi @yota9, my guess is that you cannot run the binary because of the dynamic dependencies. You should be able to see the difference in the binary output with and without your change though. Since you are just moving passes, I’d expect the binaries to be the same.

yota9 commented 2 years ago

Hello @maksfb ! I think I've found where most of the differences are came from. First of all we don't need to skip these optimizations for non-simple functions (like we didn't do it before). But mostly important I think we will need to activate fixDoubleJumps optimization pass. There are some cases like

BB1:
    < code >
    je BB4
BB2:
    nop
BB3:
    < code >
BB4:
    jmp BB3

So after NOP removal we will end up with empty BB. After fix-branches we will have double jmp BB1 -> BB2 -> BB3. Could you please check if the test works now adding -peepholes=double-jumps option? If yes I will refactor the code moving fix-branches to stand-alone pass and running it unconditionally, although I'm not sure will it fix the problem or not, since I don't really see it here

yota9 commented 2 years ago

I've updated the patch, now the diff in text is about ~200 bytes, probably due to double-jumps option :)

maksfb commented 2 years ago

Thanks, @yota9! Will give it a try.

yota9 commented 2 years ago

@maksfb I've checked, the last bytes are changed by the fixBranches call, it inverts the cond jumps in a few places, I don't think it's a big deal. Also we should probably preserve branch information in replaceSuccessor call in fixDoubleJumps. Anyway I hope it will work now

maksfb commented 2 years ago

@yota9, still got a crash.

How do you use the test case I've uploaded? You have to add llvm-bolt to your PATH and then run make.

It seems to be the test is incomplete. The makefile is wrong, e.g. to build the object file it needs to build bolt file first.

The object file is going to be built using the BOLTed compiler cc1plus.bolt. %.bolt rule in the Makefile will produce that compiler. cc1plus.bolt is the one that is crashing for me with the change.

I've fixed that but I'm unable to build it, there are some compiler errors + there is no main function. I'm tried to bolt the executable that was in the archive and it is finished fine, but test seems to be want to read smth from stdio, so I'm not sure how to deal with it :)

Oh. I guess you were trying to run BOLT on bf_gcc? bf_gcc.cpp should be passed to cc1plus.bolt using the options in the Makefile.

yota9 commented 2 years ago

@maksfb Oh I see, thanks for your explanation! I'm having Segmentation fault now too

maksfb commented 2 years ago

Hi @yota9, thank you for the investigation and for the fix. @rafaelauler, could you comment on the shrink-wrapping pass?

Do you know what is different about the function where the bug surfaced? Is it a different CFG or a different profile with your change?

I believe we have to change the profile-reading logic starting here: https://github.com/facebookincubator/BOLT/blob/61a2062542cd7f015cde0e0ff06abe305993943d/bolt/lib/Profile/DataReader.cpp#L744-L748

yota9 commented 2 years ago

@maksfb I've updated the last patch, it is the stable_sort the place where the error occurs I think. We need to place stores first, and the conditions for A and B were messed as I see. So I've removed the previous comment. But I still have a question about the fixDobleJmp pass, how do you think could be enable it by default?

Do you know what is different about the function where the bug surfaced? Is it a different CFG or a different profile with your change?

Sorry, currently I didn't have time to check why this situation occured

@rafaelauler Could there be situation with multiple pushes by the way? Should we sort them too in mirrored order like with pops?

maksfb commented 2 years ago

With the introduction of nop basic blocks, it makes sense for me to run fixDoubleJumps right after removeNops. I don't think you need to invoke fixBranches(). What was the reason you've added the call to it?

I don't have the details, but we've run into a problem with fixDoubleJumps() on AArch64 in the past. It was something related to tail calls.

rafaelauler commented 2 years ago

Thanks @yota9 for fixing the bug in shrink wrapping! Changes on the shrink-wrapping sorting look good to me.

I've created a testcase for your change, could you please put it together with your change? Here's the testcase: https://pastebin.com/nfTZKCpQ

Regarding your question, you're right, I think we would need to reorder the stores as well, but I haven't been able to create a test case that triggers a bug with that yet. I'll take a look at that.

Also, once this is ready, could you squash all your changes in a single commit, or put each commit in a separate PR? Thanks!

yota9 commented 2 years ago

@rafaelauler Thank you for the test case! It's in #251 @maksfb I didn't check it, I just done it due to the jump instruction should change it's target, I have another idea of fixing this anyway. But back to this patch, it seems to be that the problem has been solved, I'm not sure if I need to change DataReader, it seems to be working fine. As I saw the changes in that function are caused by not removed double jmp for some reason, however I'm not 100% sure to be honest, but the basic blocks stat seems to be right. So I hope that after refactoring it will be ready to go.

yota9 commented 2 years ago

Hello @maksfb @rafaelauler ! I've updated the nop removal pass, checking that BB is empty we may change the successor in the predecessors, than the EliminateUnreachableBlocks pass will remove such BBs and fixBranches will fix the instructions. UPD: The problem with bb_with_two_tail_calls is that we are removing empty BB now, and the const bool HasFallthrough = (NextBlock && PredSucc == NextBlock); variable becomes true, so we won't insert JMP instruction and eliminate double jmp further to unite 2 BBs. So I just need to replace NOP instruction with other one and the test will pass. So now everything seems to be fine UPD2: I forgot to run tests after rebase. The BOLT :: X86/peepholes.test is easy to fix. I will take a closer look to the BOLT :: X86/fallthrough-to-noop.test in the nearest time. The BOLT :: runtime/meta-merge-fdata.test seems to be failing constantly in CI.

maksfb commented 2 years ago

Hi @yota, any update on fallthrough-to-noop.test failure?

yota9 commented 2 years ago

Hello @maksfb. Sorry I didn't have time to check it yesterday, I will back to this soon :)

maksfb commented 2 years ago

@yota9, we need to provide "backwards compatibility" for .fdata produced before this change. fallthrough-to-noop.test fails because the destination for the fall-through is in the middle of the basic block. We can check that it's right after the nop, but then we need the Size annotation.

yota9 commented 2 years ago

@maksfb The size for the noop Instruction you mean? The golang pass already adds it buy the way :)

maksfb commented 2 years ago

Yes, that's what I meant since you've mentioned it earlier.

yota9 commented 2 years ago

Hello @maksfb ! Thank you for your comments, I've fixed the test, although as you said the problem was only with "backwards compatibility", that is why I didn't face this issue before, since we don't have such needs :) Anyway seems to be everything works now as expected

maksfb commented 2 years ago

Hi @yota9, thank you for the quick fix. We have more .fdata compatibility issues and will post a test case soon.

yota9 commented 2 years ago

@maksfb Is it the test problem or you need that for your production? To be honest I'm not really like the code I've added today, since new profile will work just fine. It is a question of how many issues and extra code we will need to add to support backward compatibility. I mean if we are speaking about production env, than probably you are getting the profile each time the software is changed, since the old profile is no longer valid obviously.. Do you have some services that are rarely changed and you use the same profile each time? Maybe we should add profile versioning or something to raise an error/warning in case the version mismatch like the PGO does, what do you think?

rafaelauler commented 2 years ago

@yota9 Can you try these tests with your patch? https://github.com/rafaelauler/bolt-tests

I've put some instructions in the README regarding how to run these tests. Let me know if it works for you.

yota9 commented 2 years ago

@rafaelauler I've managed to reproduce it, thank you! It will need some time to figure out for all 8 tests :(( As I understand currently you have some problems only with aarc64 and the x86 is OK?

rafaelauler commented 2 years ago

@yota9 we have some large x86 tests failing too, I'll try to add them to the repo when I have some time.

yota9 commented 2 years ago

Hello @maksfb @rafaelauler ! After I've replaced ReorderBasicBlocks to be after EliminateUnreachableBlocks 2 tests from the repo above are almost works - only the "executed instructions" counter is a little bit differs and the "UCE removed 0 blocks and 0 bytes of code" should be changed/removed since we are removing empty BBs now. As I see most of the tests are using non-LBR profiles. And as I understand readSampleData doesn't use offset annotation. So when previosly we removed the nops the offsets were shifted and the samples might started point out on the wrong BBs. Now when the nops are in place the statistics seems to be calculated right, and the rest of the changes in binaries are came from BB reordering and hot-cold splitting (e.g. I see that some of the blocks previously was put in the cold section, now it is in hot and vice-versa). What do you think about it? Thank you!

yota9 commented 2 years ago

Hello @maksfb @rafaelauler ! Any thoughts on the investigation above? Thank you!

maksfb commented 2 years ago

@yota9, apologies for the delay. Will take a look later today.

maksfb commented 2 years ago

As I see most of the tests are using non-LBR profiles.

Which tests? I believe most of our tests should be using LBRs.

maksfb commented 2 years ago

As I see most of the tests are using non-LBR profiles.

Which tests? I believe most of our tests should be using LBRs.

Oh, AArch64?

yota9 commented 2 years ago

@maksfb Yes, take a look at the tests @rafaelauler gave above for aarch64. The profiles are mostly (all?) non-LBR

maksfb commented 2 years ago

@maksfb Yes, take a look at the tests @rafaelauler gave about for aarch64. The profiles are mostly (all?) non-LBR

Right, for AArch64 that's all we have. Please add LBR tests for AArch64 if you have them.

We have x86-64 tests failing as well, they are just not in the repo yet. They use LBR, but likely the offsets are off as well.

Sadly, we need to keep backwards profile compatibility at least for some time, and then we will have to update all our tests if we are going to remove it.

Did you think about removing empty blocks and fixing double jumps right after RemoveNops pass? Or is there a reason not to do it for golang?

yota9 commented 2 years ago

@maksfb I've updated the pass, please take a look, it is now making empty BBs unreachable and they will be removed by EliminateUnreachableBlocks pass later, do you think we need to elliminate them right in the pass?

Right, for AArch64 that's all we have. Please add LBR tests for AArch64 if you have them.

Sadly, we need to keep backwards profile compatibility at least for some time

If I understand it correctly the problem is that for non-LBR tests we won't be able to make absolute compatibility. As I said in non-LBR mode the offset annotations are ignored, so previously when the nops were removed the offsets in fdata samles might started to point on the wrong BB, which is located furher, since the offset was shifted. So now with NOPs it is pointing to the right BBs, but the BB reordering and hot-cold splitting might result in the other binary layout, since some of the previously HOT bbs might be cold now due to the nops. That is why (I think) the tests are failing now. But 2 tests are almost work, only the "executed instructions" counter is a little bit differs (due to the nops (?) and the "UCE removed 0 blocks and 0 bytes of code" should be changed/removed since we are removing empty BBs now.

maksfb commented 2 years ago

@maksfb I've updated the pass, please take a look, it is now making empty BBs unreachable and they will be removed by EliminateUnreachableBlocks pass later, do you think we need to elliminate them right in the pass?

The sooner in the pass manager we eliminate empty BBs the better. It will put less burden on later optimization passes. We only need nops and containing BBs for accurate byte-level representation of the input and for special cases where nops should be preserved. Right?

If I understand it correctly the problem is that for non-LBR tests we won't be able to make absolute compatibility.

Yes, sorry I wasn't clear, the backwards compatibility is needed for LBR profiles only. For AArch64 we need to update tests; and I cannot do it since I have limited access to such hardware. If you can update them or update with LBR profiles that's all we need. If you don't do this in this PR we'll have broken AArch64 tests.

yota9 commented 2 years ago

We only need nops and containing BBs for accurate byte-level representation of the input and for special cases where nops should be preserved. Right?

Sure. Although I don't think the empty unreachable BBs will make much of the difference but it is not a big deal to remove them right there.

If you don't do this in this PR we'll have broken AArch64 tests.

This is FBs internal tests, they are not in the repo, @rafaelauler loaded them in separate one :)

Currently all the X86 LBR tests I have are passed

maksfb commented 2 years ago

This is FBs internal tests, they are not in the repo, @rafaelauler loaded them in separate one :)

The reason we didn't put them in the main repo is that we don't want medium/large binaries in the main repo. The external repo will contain binary tests.

Currently all the X86 LBR tests I have are passed

I see a couple of test failures in CI (other than the meta test).

yota9 commented 2 years ago

@maksfb Sorry, my fault. Updated + erase invalid BBs in remove-nops pass.

yota9 commented 2 years ago

@maksfb All the tests (except meta) are passed

maksfb commented 2 years ago

Great. Thank you, @yota9. I will likely have to add more profile backwards compatibility fixes for our internal tests before we add these tests to the external binary repo.

Meanwhile, think about adding AArch64 LBR tests. The repo added by @rafaelauler is at a temporary location. Once we integrate into the monorepo, we would prefer moving this binary repo under llvm umbrella (similar to llvm-test-suite repo).

yota9 commented 2 years ago

@maksfb Great, I hope we will be able to submit this in the nearest time :)

Meanwhile, think about adding AArch64 LBR tests. The repo added by @rafaelauler is at a temporary location.

OK, we will add the task to replace the fdata in these tests to LBR ones to our backlog, we will handle this later.

yota9 commented 2 years ago

Hello @maksfb @rafaelauler ! I'm aware about aarch64 tests, our team will handle it, is there something else I can help you to merge this PR? Thank you! P.S. And other PRs too :)

maksfb commented 2 years ago

Hi @yota9, I'm currently working on fixing crashes related to this PR, adding backwards compatibility, and making sure all nops are removed.

yota9 commented 2 years ago

Thank you @maksfb ! I hope it will go smoothly :)

yota9 commented 2 years ago

Hello @maksfb @rafaelauler ! @ElvinaYakubova tried to intrument the only one left arm test from the repo above to get the LBR profile but the test is compiled without emit-relocs. Also the inputs for the test is needed to get the new profile. Thank you!

yota9 commented 2 years ago

Hello @maksfb @rafaelauler ! Any updates on this? Thank you

maksfb commented 2 years ago

@yota9, I think I resolved all issues but one that I'm currently tackling.

maksfb commented 2 years ago

This PR should be integrated after the next sync. I made some changes to the original PR code, let me know if they look good you. Thanks again.

facebookarchive / BOLT

Move disassemble optimizations to optimization passes #231