Edward Ma | Random Walk in Node Embeddings (DeepWalk, node2vec, LINE, and GraphSAGE).

NorbertZheng commented 1 year ago

Edward Ma. Random Walk in Node Embeddings (DeepWalk, node2vec, LINE, and GraphSAGE).

NorbertZheng commented 1 year ago

Overview

Instead of using traditional machine learning classification tasks, we can consider using graph neural network (GNN) to perform node classification problems.

By providing an explicit link of nodes, this classification problem is no longer classified as an independent problem but leveraging graph structures such as the degree of nodes.

The usefulness of graph properties assumes that individual nodes are correlated with other similar nodes.

Typically example is a social media network. Imagine how Facebook connects you and somebody else based on what post you like, where you check-in etc. A graph is capable to represent this kind of relationship and we can leverage it to train GNN. Detail use cases of GNN will be covered in later stories.

We went through a knowledge graph embedding in the last story. It covers a framework to train graph embeddings via TransE, and ComplEx on a large-scale. In this story, we would like to talk about graph structure and random walk-based models for learning graph embeddings. The following sections cover DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016), LINE (Tang et al., 2015) and GraphSAGE (Hamilton et al., 2018)

NorbertZheng commented 1 year ago

Graph Structure

Before ego through those random walk-based model, we need to understand some basic graph structures.

First-order Proximity: As known as Homophily. If two nodes are connected, they are homophily or have first-order proximity.
Second-order Proximity: If two nodes share many connections, they have higher second-order proximity. i.e. node $5$ and node $6$ in the below figure (Tang et al., 2015). Although node 5 and node 6 are not connected directly, they share exactly the same neighbor nodes. It indicates they have a degree of similarity. On the other hand, it helps tp convergence those fewer neighbor nodes.
Structural Equivalence: It refers to similar structural roles. For example, two nodes are structural equivalence if two nodes are connected to three different nodes. i.e. node $u$ and node $s6$ in the below figure (Grover and Leskovec, 2016).

NorbertZheng commented 1 year ago

DeepWalk

DeepWalk (Perozzi et al., 2014) is introduced to learn node embeddings via a random walk and word2vec (Mikolo et al., 2013) word2vec algorithm. In natural language processing (NLP), we can apply the skip-gram model and feed a sentence (a series of words) to train word embeddings. In short, the training objective is using the center word to predict surrounding words. You may check this story for detail.

Go back to DeepWalk, it picks a node randomly and "walk" to a neighbor node randomly until it reaches maximum length (or some random length). Given the following knowledge graph (upper part), we pick "A" and "F" as starting points respectively. Later on, we move the node "A" to node "C" and so on.

Above: Knowledge Graph. Below: Random Walk Sequence

In this case, we treat node and series of nodes as "word" and "sentence" respectively. Later on, we can reuse the skip-gram algorithm to convergence node embeddings.

NorbertZheng commented 1 year ago

node2vec

node2vec (Grover and Leskovec, 2015) is an advanced version of DeepWalk (Perozzi et al., 2014). One limitations of DeepWalk (Perozzi et al., 2014) is that you cannot control the path. node2vec ((Grover and Leskovec, 2015) also use random behavior but having weight on it.

Instead of "walk" randomly, authors introduce Breadth-First-Sampling (BFS) and Depth-First-Sampling (DFS) to control the random behavior. BFS reaches immediate neighbors while DFS prefer node away from the source.

BFS: Breadth-First-Sampling. DFS: Depth-First-Sampling (Grover and Leskovec, 2016)

To provide this flexibility, random walk probability is no longer unweighted while it can achieve by the following step. $d_{tx}$ means the distance from the source while $p$ and $q$ are parameters. A larger $p$ ensures that it is less chance to sample a visited node. $q$ allows us to control whether we want BFS or DFS. A larger $q$ tends to make the random walk around the center.

$$ \alpha{pq}(t,x)= \begin{cases} \frac{1}{p}& if\ d{tx}=0,\ 1& if\ d{tx}=1,\ \frac{1}{q}& if\ d{tx}=2, \end{cases} $$

NorbertZheng commented 1 year ago

LINE

LINE (Tang et al., 2015), aka Line Large-scale Information Network Embedding, proposes to use first-order proximity and second-order proximity to learn node embeddings.

To minimize the first-order proximity objective, connected nodes will convergence to each other in embedding space. Note that it is only applicable for an undirected graph.

$$ O{1}=-\sum{(i,j) \in E}w{ij}logp{1}(v{i},v{j}), $$

Same as above, we need to minimize the second-order proximity objective to convergence nodes who share similar neighbor nodes.

$$ O{2}=-\sum{(i,j) \in E}w{ij}logp{2}(v{j}|v{i}), $$

NorbertZheng commented 1 year ago

GraphSAGE

GraphSAGE (Hamilton et al., 2018), a.k.a. Graph SAmple and aggreGatE, is a model that generates node embeddings on the fly. Unlike other models, it does not train specific node embeddings but trains an aggregator. Due to this nature, it can handle unseen nodes.

This approach including two major steps:

sample neighbor nodes,
aggregate it.

The following figure illustrates the step:

Pick one node (i.e. center of this graph). Pick some neighbor nodes in each layer (i.e. $k=1$, $k=2$).
Aggregate nodes in each layer (blue layer and green layer).
Feed embeddings to the neural network and predict it.

As mentioned before, it only samples some neighbor nodes for training in every mini-batch. The random walk approach is used and it is unweighted (same as DeepWalk).

But how do we aggregate it? The authors proposed 4 ways which are mean aggregator, GCN aggregator, LSTM aggregator, and pooling aggregator.

Mean Aggregator: Taking the average of the neighbor and concatenating it with the target node.
GCN Aggregator: Taking average after concatenating neighbor and nodes. The sequence is reversed when compared with a mean aggregator. The authors demonstrated it achieves a better result.
LSTM Aggregator: Randomly arrange neighbor orders and feed into the LSTM layer.
Pooling Aggregator: Feed to multi-layer perceptron (MLP) and performing the max-pooling.

NorbertZheng commented 1 year ago

Take Away

The idea of DeepWalk is easy but it comes with some limitations. Since it is a purely random walk (unweighted), frequency or importance cannot influence embeddings and account for graph structure. Since it needs to "walk" to gain information from a neighbor, it is hard to support unseen data.
node2vec address some of the limitations of DeepWalk such as unweighted problem. However, it can suffer from a higher computation requirement.
DeepWalk and LINE use a depth-first search (DFS) strategy for all nodes and a breadth-first search (BFS) strategy on low-degree nodes respectively. On the other, node2vec mix both BFS and DFS to learn node embeddings.

NorbertZheng commented 1 year ago

Extension Reading

DeepWalk repository
node2vec repository
GraphSAGE repository
PyTorch Geometric, https://arxiv.org/abs/1903.02428

NorbertZheng commented 1 year ago

Reference

B. Perozzi, R. A. Rfou, and S. Skiena. DeepWalk: Online Learning of Social Representations. 2014
J. Tang, M .Qu, M. Wang, M. Zhang, J. Yan and Q. Mei. LINE: Large-scale Information Network Embedding. 2015
A. Grover and J. Leskovec. node2vec: Scalable Feature Learning for Networks. 2016
W. L. Hamilton, R. Ying and J. Leskovec. Inductive Representation Learning on Large Graphs. 2018

NorbertZheng / read-papers