ICLR'20 | Learning Execution through Neural Code Fusion

ganler commented 3 years ago

https://arxiv.org/abs/1906.07181

ganler commented 3 years ago

Just to quickly review some knowledge in GNN. https://www.youtube.com/watch?v=eybCCtNKwzA

NN4G

Node ->{FCN} embedding;
For each node: Li -> L{i+1}: aggregate (w * mean) neighbors (distance = 1);
read out: For each layer, play sum/mean on all nodes. Then use FCN to read out a final value.

DCNN: Diffusion-Conv NN

For each node: Li -> L{i+1}: aggregate (w * mean) neighbors (distance = i);

Given features (Hi) from all layers:

MoNET

instead of aggregation via w mean, use W H (do not simply use summing up);
The weight is fixed...

GraphSAGE

Aggregation: mean, max pool, or LSTM;

LSTM is a sequential model but in sampling, to make it unorder, we do random sampling.

GAT (..attention)

Apply attention for neighbors;

Calculate energy given 2 neighbors (directed edge weight); this energy variable will be regarded as weight.

GIN (...isomorphism ICLR 2019)

Use sum instead of mean or max (cuz they can hardly differentiate some graphs (diff graphs but the same results

Red box implies "sum";
left item represents how to consider the weights of self and neighbors (different layers)

ganler commented 3 years ago

Quick review of NCF ( Neural Code Fusion)

We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone.

Instead, we argue that these tasks (control-flow and data-flow) jointly model the intermediate behavior of a program as it executes. During execution, there is a rich and informative set of features in intermediate memory states that models can learn to drive both prediction tasks. Additionally, since programs are highly structured objects, static program syntax can supplement dynamic information with additional context about the program’s execution.

Task focuses: 1) address prefetching (code data-flow); and 2) branch prediction (control-flow);
[representation]; to model the problems above, 2 points are leveraged
- Leverage assembly-like languages instead of source codes, where we focus on 3 types of assembly:
  - register read | memory access | branching;
  - Finally, "we" leverage a GNN to model the data-flow and control-flow;
- a novel snapshot mechanism that feeds limited memory states into the graph.

Contributions:

An extensible graph neural network based representation of code that fuses static and dynamic information into one graph.

A binary representation for dynamic memory states that generalizes better than scalar or categorical representations.

The first unified representation for both control-flow and data-flow tasks.

State-of-the-art performance, improving branch prediction accuracy by 26% and address prefetching accuracy by 45%.

We show that NCF representations pre-trained on branch prediction are useful for transfer learning, achieving competitive performance on an algorithm classification task.

Nice sentences

As the performance of computer systems stagnates due to the end of Moore’s Law, there is a need for new models that can understand and optimize the execution of general purpose code. Recent work has started to frame many canonical tasks in computer architecture as analogous prediction problems, and have shown that deep learning has the potential to outperform traditional heuristics. Moreover, NCF is orthogonal to existing history-based methods, and could easily combine them with our learned representations to potentially boost accuracy further. To our knowledge, NCF is the first instance of a single model that can learn simultaneously on control-flow and data-flow tasks, setting the stage for teaching neural network models to better understand how programs execute.

ganler commented 3 years ago

Background

3 key components: 1) Gated GNN; 2) Static representation via assembly code; 3) binary memory snapshot to drive dynamic information;

GGNN

https://arxiv.org/pdf/1511.05493.pdf

GRU

Big idea: Like all RNN models, we transfer one state/input to the next state/output;

A variant of RNN; Same motivation as LSTM; Why GRU: it performs similarly as LSTM but it is computationally cheaper;

Given input x and last hidden state h, we compute z and r based on concat(x, h); z is for "update"; r is for "reset";

h' = h [hardamard product] r; h' = concat(h', z)

The magic behind GRU is the functionality to control "forget" and "update" memory;

update: ht = f(h{t-1}, z, h'); where h' is controlled by h_{t-1}, z and r; and r will help control "forgetting and selective memory";

GRU based GNN

Input: a graph; Output: a sequence; (previously, most GNN only outputs a value)

Typically, in every T steps, GGNN will expand the output sequence of GRU; And GGNN does backpropagation through time;

So GGNN does aggregation using GRU and for each aggregation step, we will receive an output; and we make them a sequence;

Program representation

static: assembly code; dynamic: registers + recent memory (cuz the whole memory is too big to compute)

the base task is about branch prediction (control-flow) and prefetching (data-flow);

ganler commented 3 years ago

Representation

Cvt assembly codes into a graph;

How to fuse dynamic information into static codes (the graph arch)

Each variable node and pseudo node is initialized with a dynamic value from the memory snapshot.
instruction nodes are initialized with zero vectors as embeddings

These values (binary patterns) are converted into initial node embeddings via a learned embedding layer.

Task

Node types: task node (masking using 1) or non-task node (masking using 0);

Branch-prediction is defined on the branch instruction node. So the output of the task node will determine the branch prediction.
Prefetching is defined on the src pseudo node that represents a memory load operation. The task is to predict the 64-bit target address of the next memory load from this node.

Challenges

Scaling to large programs. For some large-scale programs, it is unrealistic to utilize the static graph built on the entire assembly file. For example, the gcc benchmark studied in this paper has more than 500K instructions. To handle large graph sizes, we only use a part of the graph that is close to the task node in a snapshot. Given a snapshot, we discard instruction nodes and their child nodes if an instruction node is over 100 steps to the task node. As in Li et al. (2015), the message propagation mechanism runs for a fixed number of steps and only nodes whose distances to the task node are within the total number of propagation steps will affect the prediction.

Evaluation

Amazing...

ganler / ResearchReading