ganler / ResearchReading

General system research material (not limited to paper) reading notes.
GNU General Public License v3.0
20 stars 1 forks source link

ICLR'20 | Learning Execution through Neural Code Fusion #50

Closed ganler closed 3 years ago

ganler commented 3 years ago

https://arxiv.org/abs/1906.07181

ganler commented 3 years ago

Just to quickly review some knowledge in GNN. https://www.youtube.com/watch?v=eybCCtNKwzA

NN4G

DCNN: Diffusion-Conv NN

Given features (Hi) from all layers:

image

MoNET

GraphSAGE

Aggregation: mean, max pool, or LSTM;

LSTM is a sequential model but in sampling, to make it unorder, we do random sampling.

GAT (..attention)

Apply attention for neighbors;

Calculate energy given 2 neighbors (directed edge weight); this energy variable will be regarded as weight.

GIN (...isomorphism ICLR 2019)

image

image

ganler commented 3 years ago

Quick review of NCF ( Neural Code Fusion)

  • We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone.
  • Instead, we argue that these tasks (control-flow and data-flow) jointly model the intermediate behavior of a program as it executes. During execution, there is a rich and informative set of features in intermediate memory states that models can learn to drive both prediction tasks. Additionally, since programs are highly structured objects, static program syntax can supplement dynamic information with additional context about the program’s execution.

Contributions:

An extensible graph neural network based representation of code that fuses static and dynamic information into one graph.

A binary representation for dynamic memory states that generalizes better than scalar or categorical representations.

The first unified representation for both control-flow and data-flow tasks.

State-of-the-art performance, improving branch prediction accuracy by 26% and address prefetching accuracy by 45%.

We show that NCF representations pre-trained on branch prediction are useful for transfer learning, achieving competitive performance on an algorithm classification task.

Nice sentences

As the performance of computer systems stagnates due to the end of Moore’s Law, there is a need for new models that can understand and optimize the execution of general purpose code. Recent work has started to frame many canonical tasks in computer architecture as analogous prediction problems, and have shown that deep learning has the potential to outperform traditional heuristics. Moreover, NCF is orthogonal to existing history-based methods, and could easily combine them with our learned representations to potentially boost accuracy further. To our knowledge, NCF is the first instance of a single model that can learn simultaneously on control-flow and data-flow tasks, setting the stage for teaching neural network models to better understand how programs execute.

ganler commented 3 years ago

Background

3 key components: 1) Gated GNN; 2) Static representation via assembly code; 3) binary memory snapshot to drive dynamic information;

GGNN

https://arxiv.org/pdf/1511.05493.pdf

GRU

Big idea: Like all RNN models, we transfer one state/input to the next state/output;

A variant of RNN; Same motivation as LSTM; Why GRU: it performs similarly as LSTM but it is computationally cheaper;

Given input x and last hidden state h, we compute z and r based on concat(x, h); z is for "update"; r is for "reset";

h' = h [hardamard product] r; h' = concat(h', z)

image

The magic behind GRU is the functionality to control "forget" and "update" memory;

update: ht = f(h{t-1}, z, h'); where h' is controlled by h_{t-1}, z and r; and r will help control "forgetting and selective memory";

GRU based GNN

Input: a graph; Output: a sequence; (previously, most GNN only outputs a value)

Typically, in every T steps, GGNN will expand the output sequence of GRU; And GGNN does backpropagation through time;

So GGNN does aggregation using GRU and for each aggregation step, we will receive an output; and we make them a sequence;

Program representation

static: assembly code; dynamic: registers + recent memory (cuz the whole memory is too big to compute)

the base task is about branch prediction (control-flow) and prefetching (data-flow);

ganler commented 3 years ago

Representation

Cvt assembly codes into a graph;

image

How to fuse dynamic information into static codes (the graph arch)

These values (binary patterns) are converted into initial node embeddings via a learned embedding layer.

Task

Node types: task node (masking using 1) or non-task node (masking using 0);

Challenges

Scaling to large programs. For some large-scale programs, it is unrealistic to utilize the static graph built on the entire assembly file. For example, the gcc benchmark studied in this paper has more than 500K instructions. To handle large graph sizes, we only use a part of the graph that is close to the task node in a snapshot. Given a snapshot, we discard instruction nodes and their child nodes if an instruction node is over 100 steps to the task node. As in Li et al. (2015), the message propagation mechanism runs for a fixed number of steps and only nodes whose distances to the task node are within the total number of propagation steps will affect the prediction.

Evaluation

image

Amazing...