Closed ganler closed 3 years ago
Just to quickly review some knowledge in GNN. https://www.youtube.com/watch?v=eybCCtNKwzA
Given features (Hi) from all layers:
Aggregation: mean, max pool, or LSTM;
LSTM is a sequential model but in sampling, to make it unorder, we do random sampling.
Apply attention for neighbors;
Calculate energy
given 2 neighbors (directed edge weight); this energy variable will be regarded as weight.
self
and neighbors
(different layers)
- We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone.
- Instead, we argue that these tasks (control-flow and data-flow) jointly model the intermediate behavior of a program as it executes. During execution, there is a rich and informative set of features in intermediate memory states that models can learn to drive both prediction tasks. Additionally, since programs are highly structured objects, static program syntax can supplement dynamic information with additional context about the program’s execution.
Contributions:
An extensible graph neural network based representation of code that fuses static and dynamic information into one graph.
A binary representation for dynamic memory states that generalizes better than scalar or categorical representations.
The first unified representation for both control-flow and data-flow tasks.
State-of-the-art performance, improving branch prediction accuracy by 26% and address prefetching accuracy by 45%.
We show that NCF representations pre-trained on branch prediction are useful for transfer learning, achieving competitive performance on an algorithm classification task.
As the performance of computer systems stagnates due to the end of Moore’s Law, there is a need for new models that can understand and optimize the execution of general purpose code. Recent work has started to frame many canonical tasks in computer architecture as analogous prediction problems, and have shown that deep learning has the potential to outperform traditional heuristics. Moreover, NCF is orthogonal to existing history-based methods, and could easily combine them with our learned representations to potentially boost accuracy further. To our knowledge, NCF is the first instance of a single model that can learn simultaneously on control-flow and data-flow tasks, setting the stage for teaching neural network models to better understand how programs execute.
3 key components: 1) Gated GNN; 2) Static representation via assembly code; 3) binary memory snapshot to drive dynamic information;
https://arxiv.org/pdf/1511.05493.pdf
Big idea: Like all RNN models, we transfer one state/input to the next state/output;
A variant of RNN; Same motivation as LSTM; Why GRU: it performs similarly as LSTM but it is computationally cheaper;
Given input x and last hidden state h, we compute z and r based on concat(x, h); z is for "update"; r is for "reset";
h' = h [hardamard product] r; h' = concat(h', z)
The magic behind GRU is the functionality to control "forget" and "update" memory;
update: ht = f(h{t-1}, z, h'); where h' is controlled by h_{t-1}, z and r; and r will help control "forgetting and selective memory";
Input: a graph; Output: a sequence; (previously, most GNN only outputs a value)
Typically, in every T steps, GGNN will expand the output sequence of GRU; And GGNN does backpropagation through time;
So GGNN does aggregation using GRU and for each aggregation step, we will receive an output; and we make them a sequence;
static: assembly code; dynamic: registers + recent memory (cuz the whole memory is too big to compute)
the base task is about branch prediction (control-flow) and prefetching (data-flow);
These values (binary patterns) are converted into initial node embeddings via a learned embedding layer.
Node types: task node (masking using 1) or non-task node (masking using 0);
Scaling to large programs. For some large-scale programs, it is unrealistic to utilize the static graph built on the entire assembly file. For example, the gcc benchmark studied in this paper has more than 500K instructions. To handle large graph sizes, we only use a part of the graph that is close to the task node in a snapshot. Given a snapshot, we discard instruction nodes and their child nodes if an instruction node is over 100 steps to the task node. As in Li et al. (2015), the message propagation mechanism runs for a fixed number of steps and only nodes whose distances to the task node are within the total number of propagation steps will affect the prediction.
Amazing...
https://arxiv.org/abs/1906.07181