0xBEEEF commented 6 years ago

First of all, yes, I know that here is no forum. But I still have a stupid question. What about command set extensions in general? Which ones are already supported and which ones should be supported? I have now mainly thought of the following ones:

MMX
SSE
SSE2
SSE3
SSSE3
SSE4
SSE4a
SSE5
F16C
AVX
CLMUL
AES
FMA
TSX
BMI
MPX
SGX

Many programs and modern compilers use them automatically to speed up certain operations. I don't have a compiler that covers all the options. Maybe a wiki page wouldn't be wrong, because this question will surely come up again and again.

PeterMatula commented 6 years ago

This is a good question. I should probably write a few words about it on wiki.

Currently? None.

Capstone2llvmir currently translates instructions in these modes:

Full semantic translation. Assembly instruction gets a sequence of LLVM IR instructions that should ideally capture its full semantics.
Pseudo call translation with inputs/outputs. Assembly instruction (e.g. PMULHUW) gets an LLVM IR function call of declaration-only function (no body). This call includes parameters (inputs) and return values (outputs). This is important for data-flow analysis - what objects (registers, stacks, globals, etc.) are used and potentially changed. Example of such call in output C: _asm_PMULHUW(mm1, mm2). These translations are currently hand-written for only a few instructions (i think PMULHUW is not actually among them).
Pseudo call translation without inputs/outputs. Same as before, but I did not even included info about inputs/outputs - easier and faster for me to write the translation, but data-flow analysis has no idea what is changed and this might negatively impact output quality. This still required me to manually enable such translation for every instruction that currently supports it.
No translation. Every other instruction that was not manually modeled in one of the mentioned ways is not translated at all. I think all instructions from all sets you mentioned are currently in this category.

What I would like to do next in #115, and discussion about future extensions:

There probably will not be that many new instructions that get full semantic translation (1.). As I was rewriting our old implementation to Capstone2llvmir, I realized that for the decompilation purposes, we need to fully translate only those instructions that can be reasonably represented in C. See pages 25/51 and 26/51 in our slides. Decompiling instructions from complex extensions to their true semantics would create totally unreadable code. This was further confirmed by a recent Hex-Rays presentation. They have been doing the same thing with good results (slide 26). They even dropped rotations (slide 21). This does not mean that we will never add any such instruction model, it is nearly certain there are many instructions that we do not have right now that could be easily represented in C. But this is definitely not the case for most/all the extensions you enumerated.
It should not be hard to translate all the currently unhandled instructions (4.) to at least pseudo calls without inputs/outputs (3.). I will simply modify Capstone2llvmir to generate a __asm_<instr_name>() call if there is no translation routine for instruction instr_name.
Capstone representation usually has something like registers read/written for every disassembled instruction. So it also should not be that difficult to take it a step further and automatically generate pseudo calls with inputs/outputs (2.). How accurate this information is depends on the particular architecture. It is often present, but from my experience I can say that it is by no means complete and 100% reliable. There are two solutions:
- Fix this (add this kind of info) on our end in Capstone2llvmir - easier and faster for us.
- Improve Capstone implementation - slower and harder, but beneficial for all Capstone users.

Decompilation is one thing, but if someone would like to use RetDec framework for other purposes, he/she might want to have semantics for even very complex instructions. Right now, I would say that projects like QEMU or McSema are a better alternative in such a case. However, it might happen that someone will add complex semantics to Capstone2llvmir on their own - we currently have no such plans. This would not be easy, but if good groundworks are prepared, it might not be so bad. After all, someone had to hand-write these things in QEMU as well. Even if this happens, it would not be beneficial for decompilation (as already explained). So we would either have to keep these Capstone2llvmir translators separate, or have it all in one translator but be able to tell it what should and should not be translated - or which mode to use for which instructions.

PeterMatula commented 6 years ago

I will create a wiki page once #115 is solved (todo).

0xBEEEF commented 6 years ago

@PeterMatula Thank you for your very detailed answer! I wasn't aware that such a simple question would entail so much. You just mentioned the two other frameworks that seem to be able to transfer executable files to LLVM. I was only aware of QEMU, but not McSema. I find this one very interesting. Would it be possible to combine the two programs with each other? So, that one of the two frameworks is used to transfer a program to LLVM and decompiling your program? Then you wouldn't have to depict all of these complex topics all the time, but you could also fall back on existing stuff. Wouldn't it be an advantage even for you that you would then have less maintenance effort? McSema still has a lot of dependencies on IDA, it seems. You've already made significant progress. Wouldn't McSema also benefit from your tools so that there aren't always so many dependencies to simply translate existing programs to LLVM? Have you ever thought about joining the projects with each other? After all, everyone would ultimately benefit from this.

PeterMatula commented 6 years ago

QEMU is not producing LLVM IR, they have their own intermediate language. However, as I understand it (which is not all that much, so I might be wrong) matters there are even more complicated - they do not model all instructions directly in TCG. More complicated instructions (basically everything you asked about) is modeled as routines in C, that get somehow compiled and used - I'm not really sure how. If I'm wrong, and there is someone with more insights reading this, please correct me. I would be interested to know more.

rev.ng project is using QEMU to produce LLVM IR.

McSema is producing LLVM IR. It is also translating some of those extensions you asked about. However, as I understand it, it is more focused on QEMU-like stuff than human readable decompilation. Again, I might be wrong about this. But like I said, translating these sets is not really beneficial to decompilation output quality. Just look how McSema handles x86 FPU. There are benefits to this approach if you want to emulate programs, or check them with tools like klee. But C produced from it would look terrible. Moreover, they are using IDA for control flow recovery, so it is questionable if this is only for convenience, or it would be hard/impossible to write a recursive traversal disassembler on top of LLVM IR they produce. LLVM IR produced by our capstone2llvmir is designed with this in mind.

To conclude:

As I said, we don't see much benefit in translating these instructions for decompilation. Therefore not much benefit in using QEMU or McSema.
We don't really want to use QEMU, or related projects, because of the GPL license. It would infect all of our code base. Also, we don't want to drag such a giant, complex, and often not nicely written, thing in our infrastructure.
You could try to decompile LLVM IR produced by McSema with our llvmir2hll tool. LLVM infrastructure is very modular, so it would be possible to even run our bin2llvmir with selected passes on such LLVM IR.
Hooking McSema to our entire framework as an alternative LLVM IR producer would be very hard. Even though our tools try to be as agnostic as possible, there is a lot of stuff that would be missing.
This is not a priority right now, but If we ever get to a point where we want to hook klee (or other such tool) to our LLVM IR, then maybe we would look into possibilities of using McSema models.

0xBEEEF commented 6 years ago

Now it's all clear to me! Thanks again for this detailed version. But if even here IDA is not acting to support these sets of commands, then it is clear to me. And you're right about floating point operations!

PeterMatula commented 6 years ago

Few notes how IDA does this: https://www.hex-rays.com/products/decompiler/manual/intrinsics.shtml

We should look into this and come up with a solution that will let us deal with these instructions without cluttering the output, but in a way that provides enough information on what is going on to RetDec analyses and human users.

This issue might get solved as part of a bachelor thesis - see milestone.

PeterMatula commented 6 years ago

Capstone2LlvmIr.

avast / retdec

How good is the support of command set extensions? (MMX, SSE, SSE2, SSE3...) #193

115 have been closed. Now we are generating assembly pseudo calls for all unhandled instructions. Further improvements using intrinsics or full semantic models are possible. See https://github.com/avast-tl/retdec/wiki/Capstone2LlvmIr.