Utils: Write a Dictionary Encoder

After lifting our binary to MLIL instructions, before we can pass it to our parser, we need to do some preprocessing. Instructions produced from MLIL can contain many tokens which may make parsing difficult and are not needed for analysis. The goal of the encoder is to write a dictionary coder which takes a MediumLevelILOperation.MLIL_IF instruction and preprocesses it by encoding it to a representation that can be more easily parsed by our parser into an AST. An example encoding is shown below.

if ([ebp_1 + 0x14].b == 0 || [ebp_1 + 0x14].b != 0) then 387 @ 0x8040da8 else 388 @ 0x8040d8b -> A | !A

Notice a couple things about this example. One, each component of a boolean statement is encoded to a unique symbol (i.e. every generated symbol should correspond to an expression that can be evaluated to true or false). Two, duplicate expressions in the above example should reuse the same symbol. This is true not only for booleans within the same expression but also across multiple statements (i.e. some expressions may be used throughout multiple if statements within an obfuscated program). Three, encode the NOT expression outside of the statement. This is most likely one of the trickier parts of the encoder but the idea is to improve symbol reuse and make the encoded output more useful for the rest of our analysis.

Another quick note is to simplify our parser design, we will have all boolean symbols be one character. The supported operations are shown below.

AND : &
OR : |
NOT : !
grouping : ( and )

Action Items

Do all of the following inside utils/coding.py.

[x] Write a class, NameGenerator, with a constructor and a method, generateName which generates a new unique name. For our purposes, names will only contain characters belonging to string.ascii_uppercase (i.e. "A", "AB", "CAB" are all valid names). It may be useful to try and follow the factory design pattern (although this may not be strictly necessary). Some tips may involve figuring out some way to keep track of existing names (one suggestion is a count of the number of previously generated names).
[x] Write a dataclass for a boolean, Boolean. This should contain fields for information such as the raw string (e.g. if ([ebp_1 + 0x14].b == 0) then 387 @ 0x8040da8 else 388 @ 0x8040d8b) and the encoded form (e.g. A && B || C). This will be useful since internally for the dictionary encoder, you will most likely need to contain a mapping from the
[x] Write a DictionaryEncoder class with a constructor, an accessor method for an internal mapping of the encoded values, and a method encode which takes in a string from a MediumLevelILOperation.MLIL_IF instruction and returns another string which represents the encoded form of the boolean expression.
[x] Inside your encode function, make sure to preprocess the input (e.g. remove the if, then, else and the parenthesis). Ensure that you tackle all of the above mentioned edge cases. One hint is that the internal mapping should be a Python dictionary and the Boolean dataclass you wrote. The key can be the conditional of the raw string (e.g. for the raw string if ([ebp_1 + 0x14].b == 0) then 387 @ 0x8040da8 else 388 @ 0x8040d8b, the key in the mapping can be just [ebp_1 + 0x14].b == 0 for the encoded value of A). Remember that potentially, some boolean values may be reused multiple times within an expression or within a later boolean statement so you should check your dictionary to see if a mapping exists before generating a new symbol from the NameGenerator.
[x] Write some test cases for the above classes you wrote using the the builtin unittest Python module. This is a good practice for test-driven development (TDD). Create a new file called tests/test_coding.py which contains these unit tests.

Resources

Python Dictionaries: The main data structure that you will be using for this component is a dictionary which will act as a mapping between the raw and encoded values.

I can add more resources here as necessary so just add some questions depending on what you might need some resources to tackle this.

bliutech / mbased

Utils: Write a Dictionary Encoder #9

Action Items

Resources