c3rb3ru5d3d53c / binlex

A Binary Genetic Traits Lexer Framework
The Unlicense
383 stars 45 forks source link

Decompiler Core Improvements #42

Open c3rb3ru5d3d53c opened 2 years ago

c3rb3ru5d3d53c commented 2 years ago
herrcore commented 2 years ago

Discovery

Before attempting to add additional disassembly passes we need to do a bit of discovery and documentation first. Mainly we want to have a documentation that explains exactly what coverage we expect, how the basic blocks are separated into a control flow graph (CFG), and how the wild carding is applied. Changes to either the coverage, the CFG, or the wildcarding will make all future traits generated incompatible with past traits so it is important that these not be changed often, and that they be well documented.

General Questions To Answer

Coverage

Test Methodology

In addition to understand what we expect in coverage we will also want to generate some test data so we can compare before/after. One possible approach is to add some code that will give us a map (json dictionary) of the identified functions and the basic blocks contained within. This could then be used to first generate a "current coverage" data set that could be compared both against other disassembly tools, as well as our final changes.

Research

The following are some open source implementations of disassemblers with decent coverage.

c3rb3ru5d3d53c commented 2 years ago

Q: What sections of a file are submitted for disassembly? Does this depend on the execution permissions? A: Right now only sections marked as executable are extracted and submitted for disassembly. Reference: https://github.com/c3rb3ru5d3d53c/binlex/blob/0f847575eb53ccc158ce33dfc9480a6caf61e2b0/src/pe.cpp#L63

Q: Should the disassembler be responsible for detecting non-native code or should this be handled elsewhere? How do we determine if this is VB6? A1: Non-native code should be handled with a different decompiler, if it is not supported by capstone we make another decompiler/lexer, which will provide use the functionality we need. Similar to cil.cpp. However, this is a lot of extra custom work and likely out of scope for the limited time we have. A2: This maybe out of scope for this GeekWeek, however we can provide a yara signature to end users.

Q: How is wildcarding applied? Can an argument be added to control this? A1: Right now it is applied only to memory references, however I do believe we should also add support for immediates. A2: Yes, additional code can be added in a specific function to add this functionality, I can point you to it :) Reference: https://github.com/c3rb3ru5d3d53c/binlex/blob/0f847575eb53ccc158ce33dfc9480a6caf61e2b0/src/decompiler.cpp#L565-L583

Q: Is it possible to specify new disassembly start points? Possibly extend this to allow for PDB data later? A1: Yes, the initial disassembly starting points are all based on a C++ queue, this queue is started by the parser for each file format. This is already started for collecting DLL or shared library functions in the code. A2: It is possible to extend for PDB data later, however I think it maybe out of scope for this GeekWeek Reference: https://github.com/c3rb3ru5d3d53c/binlex/blob/0f847575eb53ccc158ce33dfc9480a6caf61e2b0/src/pe.cpp#L59-L84

Q: What sort of coverage is provided for compiled languages that result in non-referenced functions (ie. vtables)? A: Currently we do not have any code that specifically looks for vtables, it could certainly be additional logic that can be added to the initial parsing of the file, which will then populate more offsets with their associated sections to fill the decompile queue.

herrcore commented 2 years ago

Basic blocks are not normalized (they overlap):

Simple example: pe.emotet.x86

BB at offset 121979 overlaps with BB at offset 120533. We don't need to normalize this but it might be nice to have a flag make it an option as it will be a large performance hit.

herrcore commented 2 years ago

Added the pe and elf entry points to the functions to be processes instead of just pushing the address of the start of the function b86a709cb83490064629b8c32b29999f455602ed

Improved coverage for pe.x86

Function coverage: 1.24% ( Missing 476 from total 482 ) -> Function coverage: 54.98% ( Missing 217 from total 482 ) Basic block coverage: 1.68% ( Missing 2979 from total 3030 ) -> Basic block coverage: 63.17% ( Missing 1116 from total 3030 )

c3rb3ru5d3d53c commented 2 years ago

Leaving this one open just changing title to Improvements so we can track

c3rb3ru5d3d53c commented 2 years ago

Having a look into this now, hoping to make some improvements to the linear diassembler pass, reading the paper, maybe important to note the most useful reading is in the implementation section :)

c3rb3ru5d3d53c commented 2 years ago

image

if (IsCallInsn(cs_ins) == true && valid_block_count >= 1){
    CollectInsn(cs_ins, sections, index);
}

I got some improvement on coverage here doing this, not much but some.

Continuing to read the implementation.

c3rb3ru5d3d53c commented 2 years ago

image

This one appears to be done and implemented as the method IsSemanticNopInsn.

c3rb3ru5d3d53c commented 2 years ago

image

Would need to implement an IsReachable method for this.

c3rb3ru5d3d53c commented 2 years ago

I created disassembler.cpp and disassembler.h, this should ensure there is no confusion that we are doing disassembly and not decompilation. Once ready, we should be able to substitute the decompiler.cpp and decompiler.h files for this one instead.

I've performed some improvements where now the method CollectOperands also adds the next instruction to the queue as a block.

Function coverage: 49.59%  ( Missing 243 from total 482 )
Basic block coverage: 62.08%  ( Missing 1149 from total 3030 )
Basic block errors: 0
Extra blocks from binlex: 1619

Still having a decent amount of false positives.

The linear pass jmp_address_1 and jmp_address_2 should track their next instruction address to add to queue, I suspect this is why we are still missing more basic blocks.

c3rb3ru5d3d53c commented 2 years ago

Added the method IsInvalidNopInsn, as this was in nuculeus because Visual Studio sometimes places semantic nops at the function start and Visual Studio uses int3 for padding. Thus, I added the enum BINARY_TYPE. I also changed the places we are using typedef enum with the appropriate enum as parameters.

bool Disassembler::IsInvalidNopInsn(cs_insn *insn){
    return IsNopInsn(insn) ||
        (IsSemanticNopInsn(insn) && (file_reference.binary_type != BINARY_TYPE_PE)) ||
        (IsTrapInsn(insn) && (file_reference.binary_type == BINARY_TYPE_PE));
}

We still get false positives with this, so I'm continuing to investigate.

c3rb3ru5d3d53c commented 2 years ago

Well we are here now:

Function coverage: 50.41%  ( Missing 239 from total 482 )
Basic block coverage: 61.75%  ( Missing 1159 from total 3030 )
Basic block errors: 0
Extra blocks from binlex: 595

False positives down a lot now, after removing the extra else statement from LinearDisassemble.

c3rb3ru5d3d53c commented 2 years ago

I added a check for IsFunction as part of AddDiscoveredBlock, since the collection of traits will collect blocks from functions, it makes no sense to have any duplicates between blocks and functions. Should result in a small performance gain.

void Disassembler::AddDiscoveredBlock(uint64_t address, struct Section *sections, uint index) {
    if (IsVisited(sections[index].visited, address) == false &&
    address < sections[index].data_size &&
    IsFunction(sections[index].functions, address) == false) {
        if (sections[index].blocks.insert(address).second == true){
            sections[index].visited[address] = DISASSEMBLER_VISITED_QUEUED;
            sections[index].discovered.push(address);
        }
    }
}
c3rb3ru5d3d53c commented 2 years ago

image

I'll start looking at detecting switch jump tables, we can probably add this to the CollectOperands method, so instead of doing multiple passes, we should be able to make it part of our continual collection of blocks and functions.

jbx81-1337 commented 2 years ago

Apply a wildcard of already-known function to others blocks: After we extract all the "easy" functions we can generate the epilogue/prologue wildcards of these and apply the wildcards to all the others blocks, if we find a match we can tag a block as "epilogue" or "prologue" and use it to identify other functions.