Open 0xBEEEF opened 6 years ago
This is a good question. I should probably write a few words about it on wiki.
Currently? None.
Capstone2llvmir currently translates instructions in these modes:
PMULHUW
) gets an LLVM IR function call of declaration-only function (no body). This call includes parameters (inputs) and return values (outputs). This is important for data-flow analysis - what objects (registers, stacks, globals, etc.) are used and potentially changed. Example of such call in output C: _asm_PMULHUW(mm1, mm2)
. These translations are currently hand-written for only a few instructions (i think PMULHUW
is not actually among them).What I would like to do next in #115, and discussion about future extensions:
__asm_<instr_name>()
call if there is no translation routine for instruction instr_name
.Decompilation is one thing, but if someone would like to use RetDec framework for other purposes, he/she might want to have semantics for even very complex instructions. Right now, I would say that projects like QEMU or McSema are a better alternative in such a case. However, it might happen that someone will add complex semantics to Capstone2llvmir on their own - we currently have no such plans. This would not be easy, but if good groundworks are prepared, it might not be so bad. After all, someone had to hand-write these things in QEMU as well. Even if this happens, it would not be beneficial for decompilation (as already explained). So we would either have to keep these Capstone2llvmir translators separate, or have it all in one translator but be able to tell it what should and should not be translated - or which mode to use for which instructions.
I will create a wiki page once #115 is solved (todo).
@PeterMatula Thank you for your very detailed answer! I wasn't aware that such a simple question would entail so much. You just mentioned the two other frameworks that seem to be able to transfer executable files to LLVM. I was only aware of QEMU, but not McSema. I find this one very interesting. Would it be possible to combine the two programs with each other? So, that one of the two frameworks is used to transfer a program to LLVM and decompiling your program? Then you wouldn't have to depict all of these complex topics all the time, but you could also fall back on existing stuff. Wouldn't it be an advantage even for you that you would then have less maintenance effort? McSema still has a lot of dependencies on IDA, it seems. You've already made significant progress. Wouldn't McSema also benefit from your tools so that there aren't always so many dependencies to simply translate existing programs to LLVM? Have you ever thought about joining the projects with each other? After all, everyone would ultimately benefit from this.
QEMU is not producing LLVM IR, they have their own intermediate language. However, as I understand it (which is not all that much, so I might be wrong) matters there are even more complicated - they do not model all instructions directly in TCG. More complicated instructions (basically everything you asked about) is modeled as routines in C, that get somehow compiled and used - I'm not really sure how. If I'm wrong, and there is someone with more insights reading this, please correct me. I would be interested to know more.
rev.ng project is using QEMU to produce LLVM IR.
McSema is producing LLVM IR. It is also translating some of those extensions you asked about. However, as I understand it, it is more focused on QEMU-like stuff than human readable decompilation. Again, I might be wrong about this. But like I said, translating these sets is not really beneficial to decompilation output quality. Just look how McSema handles x86 FPU. There are benefits to this approach if you want to emulate programs, or check them with tools like klee. But C produced from it would look terrible. Moreover, they are using IDA for control flow recovery, so it is questionable if this is only for convenience, or it would be hard/impossible to write a recursive traversal disassembler on top of LLVM IR they produce. LLVM IR produced by our capstone2llvmir is designed with this in mind.
To conclude:
Now it's all clear to me! Thanks again for this detailed version. But if even here IDA is not acting to support these sets of commands, then it is clear to me. And you're right about floating point operations!
Few notes how IDA does this: https://www.hex-rays.com/products/decompiler/manual/intrinsics.shtml
We should look into this and come up with a solution that will let us deal with these instructions without cluttering the output, but in a way that provides enough information on what is going on to RetDec analyses and human users.
This issue might get solved as part of a bachelor thesis - see milestone.
First of all, yes, I know that here is no forum. But I still have a stupid question. What about command set extensions in general? Which ones are already supported and which ones should be supported? I have now mainly thought of the following ones:
Many programs and modern compilers use them automatically to speed up certain operations. I don't have a compiler that covers all the options. Maybe a wiki page wouldn't be wrong, because this question will surely come up again and again.