Closed PAX523 closed 6 years ago
The call
instruction E8 05000000
is the problem: https://stackoverflow.com/questions/10376787/need-help-understanding-e8-asm-call-instruction-x86
It jumps over the inline string.
0xC 0x16
6A00 E805000000 7465787400 E805000000 7465787400 6A00 FF 15 3C 20 40 00
push 0 call 0xC "text" call 0x16 "text" push 0 ...
This is reasonable because the final invocation of MessageBoxA
expects 4 arguments on stack. The call
instruction jumps over the inline string area and puts the address of the bytes after itself on stack. This is right the location of the start of the inline string. Later, MessageBoxA
takes it from stack again. This is a very tricky and fascinating approach by Flat Assembler.
The disassembling result seems not to be reasonable
It looks like you expect to get a meaningful assembly out of Capstone for the input that intermixes executable code and program data.
I'm afraid you're out of luck with Capstone because it's merely a "flat" disassembling library that will stupidly turn any sequence of bytes you give it into assembly, assuming there is a mapping between that sequence of bytes and a command in the specified target CPU architecture.
The correct labeling of the bytes in your precise example as code/strings would require a virtual understanding of the underlying program, in other words, it's the subject of AI.
It's worth to mention that solving the data/code ambiguity, typical for the von Neumann Architecture, is considered a hard problem. It makes the static program analysis very difficult. That's why some advanced tools like IDA Pro use a bag of tricks (i.e. undocumented heuristics) to label sequences of bytes. And because it not always works well, interactive features are offered so humans can step in and assist the disassembler. A fully automatic solution doesn't exist and much likely won't ever exist...
Addendum: your precise example makes use of a machine code idiom that replaces the following code:
push pText1; // pointer to a string stored in the data section
push pText2; // pointer to another string stored in the data section
call MessageBoxA
Recognition of such idioms is another hard problem. I didn't see any disassembler that ever attempts to do that because such a task is usually reserved for decompilers. Moreover, no publicly known effort to collect and describe various compiler idioms has been made so far...
I am with you. I concerned myself with this and similar issues and came to the opinion that there are unlimited possibilities to confuse static disassemblers by this way. In order to manage this you have to do further intelligent efforts to help the disassembler but you can never be sure that the approach is absolutely correct.
I believe it's not worth to assign these intelligent analyzation tasks to a static disassembler such as Capstone. Capstone should keep as fast as possible. All additional logic must be done by the client that uses Capstone. And this seems to be time consuming and requires specific thresholds in order to decrease some efforts but on the other hand this also decreases the accuracy.
Nevertheless, this all isn't business of Capstone. But this discussion was necessary because anytime there may be other users of Capstone who are wondering why some disassembly results aren't correct. Now, the reason is documented here.
If you want better disassembly results out of the box then you need a dynamic disassembler which simulates instruction execution and follows the program flow.
If you want better disassembly results out of the box then you need a dynamic disassembler which simulates instruction execution and follows the program flow.
Yes, but such a dynamic disassembler is extremely hard to find. Until now, I never saw anything working like this in the real world. Feel free to correct me...
An alternative approach utilized nowadays is to use the static analysis augmented with some common heuristics like recognition of well-known code patterns (function prolog/epilog, switch constructs, vtables, runtime libraries etc). It's capable of achieving pretty good results that can be verified and refined by the user. Unfortunately, tools with these capabilities are rare and mostly proprietary...
What about Unicorn Engine?
You can invest such efforts into static analyzation as long as you're assuming that you process "usual" machine code consecutions. But what if you want to disassemble opcodes from malicious executables that try to confuse your disassembling tool, intentionally? In my opinion, it's “tilting at windmills" - your efforts can never tackle all cases.
What about Unicorn Engine?
Yeah, it's great as long as the CPU architecture you need is supported. If it isn't, you're hosed. Like me because I need PowerPC. Despite being announced in 2015 and available in QEmu, the PowerPC arch still remains unsupported in Unicorn, see issue 348. 😒
R2 supports powerpc emulation with esil, and capstone as a disassembler
On 19 Apr 2018, at 17:15, Maxim Poliakovski notifications@github.com wrote:
What about Unicorn Engine?
Yeah, it's great until it supports the CPU architecture you need. If it doesn't, you're hosed. Like me because I need PowerPC. Despite being announced in 2015 and available in QEmu, the PowerPC arch remains unsupported in Unicorn, see issue 348. 😒
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
I created a simple PE with Flat Assembler which inlines the strings into code section:
The assembling result is:
7465787400
seem to be the strings. The disassembling result seems not to be reasonable:Even OllyDbg fails. But the PE is executed correctly.
AddressOfEntryPoint is equal to BaseOfCode in IMAGE_OPTIONAL_HEADER.