Closed familyld closed 5 years ago
@BarclayII Do you have any ideas?
The current random walk with restart is entirely based on Pixie (and PinSage) where you can only either tune the maximum number of nodes visited, or the threshold of frequent visited nodes. So, no: currently there is no way to directly control number of traces to find for each node.
You can probably try relaxing max_nodes_per_seed
to a bigger number (greater than 20) and then take the top 20 most important nodes as neighbors.
@BarclayII I tried but it's not helpful. The number of traces and the length of each trace depend on the restart probability. Even if you set a large max_nodes_per_seed
,you can't get that much nodes in a trace if the restart probability is not low enough. However, if the restart probability is low, the number of traces will decrease too. That's why I feel so helpless to solve this problem. It's like a paradox.
The number of traces also depends on max_nodes_per_seed
, so you could try lowering restart probability and raising max_nodes_per_seed
.
Alternatively, you can set max_nodes_per_seed
to a really big number, and additionally set max_frequent_visited_nodes = 20
and max_visit_counts
to some number greater than 1.
If you only care about selecting nodes with K-highest personalized PageRank (which is traditionally computed by random walk with restart), then you can instead pre-compute the personalized PageRank scores for all the nodes and construct a new graph, where an edge connects u to v if PPR_u(v) > 0.001 (or if (u, v) is among the K-highest PPR). I wrote an example gist for this purpose with numba so it should be quite efficient.
@BarclayII Thanks a lot. I will try tuning these parameters.
Hi, @BarclayII. Is there an efficient way to choose variable-length neighbors for each nodes? Sorry for bother you, but I really don't know how to modify these code:
def random_walk_distribution(G, nodeset, num_items, restart_prob, max_nodes):
n_nodes = nodeset.shape[0]
n_available_nodes = num_items
traces = random_walk_sampler(G, nodeset, restart_prob, max_nodes)
visited_counts = torch.zeros(n_nodes, n_available_nodes)
for i in range(n_nodes):
visited_nodes = torch.cat(traces[i])
visited_counts[i].scatter_add_(0, visited_nodes, torch.ones_like(visited_nodes, dtype=torch.float32))
return visited_counts
def random_walk_distribution_topt(G, nodeset, num_items, restart_prob, max_nodes, top_T):
'''
returns the top T important neighbors of each node in nodeset, as well as
the weights of the neighbors.
'''
visited_prob = random_walk_distribution(G, nodeset, num_items, restart_prob, max_nodes)
weights, nodes = visited_prob.topk(top_T, 1)
weights = weights / weights.sum(1, keepdim=True)
return weights, nodes
I still want to preserve the weights of each neighbor and filter those with small weights. But some nodes have only 1 or 2 neighbors in my case and using topk will cause unexpected results, i.e., choosing random nodes as neighbors to satisfy the request of k neighbors. How to avoid this issue?
Thank you!
But some nodes have only 1 or 2 neighbors in my case and using topk will cause unexpected results,
Why would a node only have 1 or 2 neighbors? Does that mean the node is in a connected component with only three nodes?
But some nodes have only 1 or 2 neighbors in my case and using topk will cause unexpected results,
Why would a node only have 1 or 2 neighbors? Does that mean the node is in a connected component with only three nodes?
Yes.
But some nodes have only 1 or 2 neighbors in my case and using topk will cause unexpected results,
Why would a node only have 1 or 2 neighbors? Does that mean the node is in a connected component with only three nodes?
Yes.
In this case, it's probably not a good idea to randomly pad the neighbors and gather their information, since the "padding nodes" would be totally irrelevant to the node being considered. I think it would be better to weight the neighbors with number of visits of the neighbors (i.e. weights
in the pair returned by random_walk_distribution_topt
. Essentially, the weights assigned to the "padding nodes" would become 0. This in turn means that what nodes we pick to pad the neighbors do not matter at all, meaning that topk()
would work just fine.
In this case, it's probably not a good idea to randomly pad the neighbors and gather their information, since the "padding nodes" would be totally irrelevant to the node being considered. I think it would be better to weight the neighbors with number of visits of the neighbors (i.e.
weights
in the pair returned byrandom_walk_distribution_topt
. Essentially, the weights assigned to the "padding nodes" would become 0. This in turn means that what nodes we pick to pad the neighbors do not matter at all, meaning thattopk()
would work just fine.
0 weight is fine. The true problem is that these irrelevant nodes are also add to nodeset which means they will participate in the next round of random walk. In my case, these irrelevant nodes can be nodes that have no successors, which cause the no successors from vertex xxxx
error.
def random_walk_nodeflow(G, nodeset, num_items, n_layers, restart_prob, max_nodes, top_T):
'''
returns a list of triplets (
"active" node IDs whose embeddings are computed at the i-th layer (num_nodes,)
weight of each neighboring node of each "active" node on the i-th layer (num_nodes, top_T)
neighboring node IDs for each "active" node on the i-th layer (num_nodes, top_T)
)
'''
dev = nodeset.device
nodeset = nodeset.cpu()
nodeflow = []
cur_nodeset = nodeset
for i in reversed(range(n_layers)):
nb_weights, nb_nodes = random_walk_distribution_topt(G, cur_nodeset, num_items, restart_prob, max_nodes, top_T)
nodeflow.insert(0, (cur_nodeset.to(dev), nb_weights.to(dev), nb_nodes.to(dev)))
cur_nodeset = torch.cat([nb_nodes.view(-1), cur_nodeset]).unique()
return nodeflow
I see. I guess you need to change this statement
cur_nodeset = torch.cat([nb_nodes.view(-1), cur_nodeset]).unique()
to
nb_nodes = nb_nodes.view(-1)
nb_weights = nb_weights.view(-1)
cur_nodeset = torch.cat([nb_nodes[nb_weights > 0], cur_nodeset]).unique()
Essentially, the nodes with 0 weights shouldn't take part in the next round anyway.
@BarclayII Thanks a lot!
Closed due to inactive. Please feel free to reopen this if you still have any question.
I am using DGL to implement a graph neural network working on a bipartite graph. To perform GraphConv, I need to find
top_T=20
important neighbors of each node. However, in my case, the graph data is really sparse and it's hard to obtain that much neighbors. I want to increase the number of traces found for each node but I don't know how to achieve this because there are only two parameters I can change, i.e.,restart_prob
andmax_nodes_per_seed
in thebipartite_single_sided_random_walk_with_restart()
function. How can I control the number of traces and guarantee the number of valid visited nodes is bigger thantop_T
(so nodes that haven't been visited won't be picked as neighbors)?