facebookarchive / BOLT

Binary Optimization and Layout Tool - A linux command-line utility used for optimizing performance of binaries
2.51k stars 176 forks source link

Is there a reason for instrumentation pass to be one threaded? #185

Closed yota9 closed 2 years ago

yota9 commented 3 years ago

Hello! I see that instrumentation pass is prepared to be multithreaded, but the ForceSequential flag is set. Is there any reason for that? I'm asking because I've got pass that I've prepared to be multithreaded too, but I've out that I've got couple of issues. First of all my pass could create some MCInst annotations, so I need to use runOnEachFunctionWithUniqueAllocId. Also I'm using BC.computeInstructionSize() inside, so I also need to create Emitter to be able to use it in parallel. So now sometimes I see assertion while emitFunction is running, MCELFStreamer::EmitSymbolAttribute to be precise while it tries to cast MCSymbol to MCSymbolElf. I don't know what it the reason here, probable I need to investigate it feather one day. But anyway did you observed such issues when you were preparing Instrumentation pass to be multithreaded or there are another reasons why it is currently using only one thread?

Also I see that runOnEachFunctionWithUniqueAllocId function has this code inside: if (!BC.MIB->checkAllocatorExists(AllocId)) { MCPlusBuilder::AllocatorIdTy Id = BC.MIB->initializeNewAnnotationAllocator(); assert(AllocId == Id && "unexpected allocator id created"); }

Could I refactor it a bit and add mutex to MCPlusBiulder class and lock it on initializeNewAnnotationAllocator, and than just use the return value in runOnEachFunctionWithUniqueAllocId? Also why is it necessary to postpone task execution? The runOnEachFunction does not do it..

Bonus questions :) Speaking of BC.computeInstructionSize() function - did you observed any issues with it? We've got couple, first of all it seems to be that while using it it does not clear some of the allocated memory after it. We currently didn't investigate this issue, but it seems to be some memory leaks are presented, maybe you know the reason.. The second issue (at least on master branch with older LLVM) - it does not return right size for multi-byte x86 NOP instruction, although BC.DisAsm->getInstruction() returns the correct size.

Thank you!

rafaelauler commented 3 years ago

It's not always the case that a parallel pass will run faster, it depends a lot on how much synchronization you are performing. I think in instrumentation I gave up on the idea when I found the upside wasn't too attractive, but I left the code there in case this needs to be re-evaluated. Maybe it was a bad idea, since it is indeed weird to have that with a force sequential option.

Like you observed, it's far from trivial to get it running without any issues and sometimes the pass is just too fast to be worth it. In general, anything that deals with MCContext needs to be synced. Also, I took a look and it appears that calling emitSymbolAttribute in turn calls MCAssembler::registerSymbol which uses "push_back" on an internal data structure shared among all threads. I imagine that will cause a race if running in parallel. When dealing with that, just imagine that anything dealing with LLVM's assembler/streamer interfaces will need to be either synced or duplicated (such as is the case with computeInstrucitonSize). Some passes need to access the streamer so frequently that it defeats the purpose of running in parallel, because threads spend most of their time waiting on locks.

On the refactoring bit: Sounds reasonable, I don't think it is strictly necessary to postpone, it's just buying time to initialize the allocators.

On computeInstructionSize() leaking memory: I'm interested on that, we run some sanitizers, but they are not warning us about leaked memory. I'll see if I can repro.

On multi-byte NOP instruction: we usually ignore NOPs, but that's good to know. Perhaps we can add a warning comment in the code about this current limitation of computeInstructionSize.

yota9 commented 3 years ago

@rafaelauler Thank you for your unswer!

Like you observed, it's far from trivial to get it running without any issues

Yes, it is true.. For now I will leave it as-is, if I will have a time I will try to find out what is going on..

Sounds reasonable, I don't think it is strictly necessary to postpone, it's just buying time to initialize the allocators

I will do this soon.

On computeInstructionSize() leaking memory: I'm interested on that, we run some

You can create a dummy pass and run through all instructions with computeInstructionSize() and see if it will increase your memory consumption. It has something to do with emitter, so do it in one thread and use the just use the default emitter..

On multi-byte NOP instruction: we usually ignore NOPs, but that's good to know.

Well, small spoiler, this is one one the moments that golang support needs. We need to see the whole code as it was in the binary, so the nop removal and shortenInstruction must be moved to separate optimization passes. Anyway I think it is good to do it from the logical standpoint, so we know that the current BF is exactly the same, as in the executable before optimizations are applied. For now I've just done quick hack to solve the issue - store the size for multi-byte nops in instruction annotation :)

aaupov commented 3 years ago

@yota9: I don't know if it's related but I ran into a similar issue with computeInstructionSize in PatchEntries::runOnFunctions. My solution was to relax the instruction before calling computeInstructionSize (not pushed yet):

--- a/bolt/src/Passes/PatchEntries.cpp
+++ b/bolt/src/Passes/PatchEntries.cpp
@@ -60,6 +60,8 @@ void PatchEntries::runOnFunctions(BinaryContext &BC) {
     MCInst TailCallInst;
     BC.MIB->createTailCall(TailCallInst, BC.Ctx->createTempSymbol(),
                            BC.Ctx.get());
+    // Conservative estimate for instruction size.
+    BC.MAB->relaxInstruction(TailCallInst, *(BC.STI));
     PatchSize = BC.computeInstructionSize(TailCallInst);
   }
yota9 commented 3 years ago

@aaupov Thanks! But as I see it does not influence on NOP instructions. Also please be aware, I need to know exactly the original size of the instruction that was in the binary for Golang support. So this solution will probably break it..

aaupov commented 3 years ago

Yes, I think keeping the original size as an annotation if fine. I was thinking it might be possible to deduce it from subsequent instruction offset, but in some cases it's tricky, unreliable or impossible. So I'd also stick with an annotation.

yota9 commented 3 years ago

@aaupov Great! I also tried to use offset annotation for this purpose, but it doesn't really cover all the cases..