AlgoProphet dev issues - Githubissues

shinmao commented 2 years ago

Introduction

Work in progress

Samples

DFT (Discrete Fourier Transform): A method which can convert a sequence of complex numbers to new sequence of complex numbers with same length. Two clues can be used to identify this algorithm,
First step is Euler's formula.
Second step is identification of complex numbers (which can be implemented with struct) reference

shinmao commented 2 years ago

Literature review

Summarization and Feedback for related literatures.
Some are for Algorithm Recovery, and some others are for Type information Recovery.

Recovery of High-Level IR of Algorithms from Binary code

Thinking about existing IR is not friendly for users to figure out the algorithm, they designed a hierarchical high level representation to help discover undocumented features of program.

They only work on relationship between input and result buffer. It depends, input is the only sensitive one user can control.
They used dynamic analysis to figure out runtime information such as indirect call targets.
BackwardSlice from result buffer
Conditional and loop information are not shown in representation
data type information are not shown in representation
This paper want to generate an human- and analysis-friendly representation at the same time. But I think, CFG format is not friendly for human. Human are more familiar with "linear form".

OSPREY (S&P21)

Recovery of variables/data structures with probabilistic analysis on stripped binary, which means they synthesize a large collections of hints to guess the information. (I would like to create another comment to collect some useful hints, thanks to OSPREY)

DIRTY (USENIX22)

Augment Decompiler Output with Learned Variable Names and Types, not only recover type but also recover developer-friendly names.
Input: decompiled function tokens (from IDA) / Output: Recommend types and names for all variables included in the function

// Transformer-based NN model
Code encoder: each code piece including operands and operators
Data layout encoder: location (registers or stack), offset, size, and used to filter out impossible prediction results.

Decompiled-level can help compatibility on different optimization level (e.g., use register to access or how index access array)
Most recent papers about decompilation are working on IDA.
Still limited to training set (So how to generalize to unseen types).
Can better predicting struct if focusing on it, but worse when working on all types Online testing platform

shinmao commented 2 years ago

8/16 working progress

Currently there are more than 10 models in AlgoProphet. But models are all generated with manual effort. Some other algorithm such as fourier transform might be difficult to identify just with isomorphism; therefore, might need to change matching algorithm.

Next plans

[x] Define a recursive function to get the base type of pointer
[x] Filter out the DFG only from input parameters SSA variables used with loops are also shown in graph
[x] Figure out how to generate DFG automatically Get user's specified instruction, filter out DFG Might need to clean graph each time within single session
[x] Handle sincos cases It would be required to identify DFT algorithm
[x] Test models on amd64 architecture.
[ ] Isomorphism might make prob complex, figure out ML/DL model to do matching work

shinmao commented 2 years ago

8/30 - 9/13 working progress

[x] Right click command to generate models based on consecutive instructions In this screenshot, we can highlight consecutive instructions (would only consider the data flow used in highlighted instructions) and build a model based on them
[x] Right click to match existing models in single function single function version of match algos in command platte In this screenshot, we can click on any places of the function to match it with existing models
[x] Right click on SSA variables or constants to adjust models
We can use UIActionContext to capture the selected variables RightClick menu needs PluginCommand, but PluginCommand cannot UIActionContext Solution: We can also directly import UIContext to get UIActionContext
Adding attributes of related operation in dfg graph? No! Due to the normalization, the operation node might be changed until the final graph generated Solution: Do graph traversal to find the closest operation node Challenge: the selected token sometimes doesn't appear in graph view, we might need to track dataflow In the screenshot, we right-click on x0#3 and can remove the related operation of it.
[x] Rename variables after matching models will need to add attributes output to the node label the nodes with zero out-degree the idx of the nodes should be the instruction with left values of formula

shinmao commented 2 years ago

Hypotheses

In this month, we are exploring and developing a more friendly interface for users to generate and adjust their models. To generate the models, users can use mouse to select consecutive instructions from the BinaryView which they think are important for the algorithm. After generating the models, users can also adjust the models by interacting with the BinaryView. To make the graph matching algorithm which is used to find out the existing algorithm from the binaries more efficient, users can right-click on the BinaryView to prune the operation nodes, SSAVariables, or constants from the models generated previously. Compared to the existing methodologies which use function signatures to match the algorithms, our method provides more flexibility and possibilities. Additionally, it is also more reasonable for users to figure out why their models don’t work in some cases, and adjust their models interactively.

Plans for Next Month

Try more cases and test on different architectures or optimization levels
Fix sin and cos function and try matching Euler formula

galenbwill commented 2 years ago

8/30 - 9/13 working progress

[x] Rename variables after matching models will need to add attributes output to the node label the nodes with zero out-degree the idx of the nodes should be the instruction with left values of formula

I think renaming variables should not be automatically applied -- instead wrap it in commands:

"AlgoProphet -- Rename all matched variables". Global command that does not appear in the right-click menu.
"AlgoProphet -- Rename matched variables". Per-function command that only applies to current function, and does appear in the right-click menu.

Vector35 / AlgoProphet

AlgoProphet dev issues #1

Introduction

Samples

Literature review

Recovery of High-Level IR of Algorithms from Binary code

OSPREY (S&P21)

DIRTY (USENIX22)

8/16 working progress

Next plans

8/30 - 9/13 working progress

Hypotheses

Plans for Next Month

8/30 - 9/13 working progress