dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.53k stars 3.01k forks source link

Questions about random-walk on bipartite graph #530

Closed familyld closed 5 years ago

familyld commented 5 years ago

I am using DGL to implement a graph neural network working on a bipartite graph. To perform GraphConv, I need to find top_T=20 important neighbors of each node. However, in my case, the graph data is really sparse and it's hard to obtain that much neighbors. I want to increase the number of traces found for each node but I don't know how to achieve this because there are only two parameters I can change, i.e., restart_prob and max_nodes_per_seed in the bipartite_single_sided_random_walk_with_restart() function. How can I control the number of traces and guarantee the number of valid visited nodes is bigger than top_T (so nodes that haven't been visited won't be picked as neighbors)?

def bipartite_single_sided_random_walk_with_restart(
        g, seeds, restart_prob, max_nodes_per_seed,
        max_visit_counts=0, max_frequent_visited_nodes=0):
    """Batch-generate random walk traces on given graph with restart probability.
    The graph must be a bipartite graph.
    A single random walk step involves two normal steps, so that the "visited"
    nodes always stay on the same side. [1]
    Parameters
    ----------
    g : DGLGraph
        The graph.
    seeds : Tensor
        The node ID tensor from which the random walk traces starts.
    restart_prob : float
        Probability to stop a random walk after each step.
    max_nodes_per_seed : int
        Stop generating traces for a seed if the total number of nodes
        visited exceeds this number. [1]
    max_visit_counts : int, optional
    max_frequent_visited_nodes : int, optional
        Alternatively, stop generating traces for a seed if no less than
        ``max_frequent_visited_nodes`` are visited no less than
        ``max_visit_counts`` times.  [1]
    Returns
    -------
    traces : list[list[Tensor]]
        traces[i][j] is the j-th trace generated for i-th seed.
familyld commented 5 years ago

@BarclayII Do you have any ideas?

BarclayII commented 5 years ago

The current random walk with restart is entirely based on Pixie (and PinSage) where you can only either tune the maximum number of nodes visited, or the threshold of frequent visited nodes. So, no: currently there is no way to directly control number of traces to find for each node.

You can probably try relaxing max_nodes_per_seed to a bigger number (greater than 20) and then take the top 20 most important nodes as neighbors.

familyld commented 5 years ago

@BarclayII I tried but it's not helpful. The number of traces and the length of each trace depend on the restart probability. Even if you set a large max_nodes_per_seed,you can't get that much nodes in a trace if the restart probability is not low enough. However, if the restart probability is low, the number of traces will decrease too. That's why I feel so helpless to solve this problem. It's like a paradox.

BarclayII commented 5 years ago

The number of traces also depends on max_nodes_per_seed, so you could try lowering restart probability and raising max_nodes_per_seed.

Alternatively, you can set max_nodes_per_seed to a really big number, and additionally set max_frequent_visited_nodes = 20 and max_visit_counts to some number greater than 1.

If you only care about selecting nodes with K-highest personalized PageRank (which is traditionally computed by random walk with restart), then you can instead pre-compute the personalized PageRank scores for all the nodes and construct a new graph, where an edge connects u to v if PPR_u(v) > 0.001 (or if (u, v) is among the K-highest PPR). I wrote an example gist for this purpose with numba so it should be quite efficient.

familyld commented 5 years ago

@BarclayII Thanks a lot. I will try tuning these parameters.

familyld commented 5 years ago

Hi, @BarclayII. Is there an efficient way to choose variable-length neighbors for each nodes? Sorry for bother you, but I really don't know how to modify these code:

def random_walk_distribution(G, nodeset, num_items, restart_prob, max_nodes):
    n_nodes = nodeset.shape[0]
    n_available_nodes = num_items
    traces = random_walk_sampler(G, nodeset, restart_prob, max_nodes)
    visited_counts = torch.zeros(n_nodes, n_available_nodes)
    for i in range(n_nodes):
        visited_nodes = torch.cat(traces[i])
        visited_counts[i].scatter_add_(0, visited_nodes, torch.ones_like(visited_nodes, dtype=torch.float32))
    return visited_counts

def random_walk_distribution_topt(G, nodeset, num_items, restart_prob, max_nodes, top_T):
    '''
    returns the top T important neighbors of each node in nodeset, as well as
    the weights of the neighbors.
    '''
    visited_prob = random_walk_distribution(G, nodeset, num_items, restart_prob, max_nodes)
    weights, nodes = visited_prob.topk(top_T, 1) 
    weights = weights / weights.sum(1, keepdim=True)
    return weights, nodes

I still want to preserve the weights of each neighbor and filter those with small weights. But some nodes have only 1 or 2 neighbors in my case and using topk will cause unexpected results, i.e., choosing random nodes as neighbors to satisfy the request of k neighbors. How to avoid this issue?

Thank you!

BarclayII commented 5 years ago

But some nodes have only 1 or 2 neighbors in my case and using topk will cause unexpected results,

Why would a node only have 1 or 2 neighbors? Does that mean the node is in a connected component with only three nodes?

familyld commented 5 years ago

But some nodes have only 1 or 2 neighbors in my case and using topk will cause unexpected results,

Why would a node only have 1 or 2 neighbors? Does that mean the node is in a connected component with only three nodes?

Yes.

BarclayII commented 5 years ago

But some nodes have only 1 or 2 neighbors in my case and using topk will cause unexpected results,

Why would a node only have 1 or 2 neighbors? Does that mean the node is in a connected component with only three nodes?

Yes.

In this case, it's probably not a good idea to randomly pad the neighbors and gather their information, since the "padding nodes" would be totally irrelevant to the node being considered. I think it would be better to weight the neighbors with number of visits of the neighbors (i.e. weights in the pair returned by random_walk_distribution_topt. Essentially, the weights assigned to the "padding nodes" would become 0. This in turn means that what nodes we pick to pad the neighbors do not matter at all, meaning that topk() would work just fine.

familyld commented 5 years ago

In this case, it's probably not a good idea to randomly pad the neighbors and gather their information, since the "padding nodes" would be totally irrelevant to the node being considered. I think it would be better to weight the neighbors with number of visits of the neighbors (i.e. weights in the pair returned by random_walk_distribution_topt. Essentially, the weights assigned to the "padding nodes" would become 0. This in turn means that what nodes we pick to pad the neighbors do not matter at all, meaning that topk() would work just fine.

0 weight is fine. The true problem is that these irrelevant nodes are also add to nodeset which means they will participate in the next round of random walk. In my case, these irrelevant nodes can be nodes that have no successors, which cause the no successors from vertex xxxx error.

def random_walk_nodeflow(G, nodeset, num_items, n_layers, restart_prob, max_nodes, top_T):
    '''
    returns a list of triplets (
        "active" node IDs whose embeddings are computed at the i-th layer (num_nodes,)
        weight of each neighboring node of each "active" node on the i-th layer (num_nodes, top_T)
        neighboring node IDs for each "active" node on the i-th layer (num_nodes, top_T)
    )
    '''
    dev = nodeset.device
    nodeset = nodeset.cpu()
    nodeflow = []
    cur_nodeset = nodeset
    for i in reversed(range(n_layers)):
        nb_weights, nb_nodes = random_walk_distribution_topt(G, cur_nodeset, num_items, restart_prob, max_nodes, top_T)
        nodeflow.insert(0, (cur_nodeset.to(dev), nb_weights.to(dev), nb_nodes.to(dev)))
        cur_nodeset = torch.cat([nb_nodes.view(-1), cur_nodeset]).unique()

    return nodeflow
BarclayII commented 5 years ago

I see. I guess you need to change this statement

cur_nodeset = torch.cat([nb_nodes.view(-1), cur_nodeset]).unique()

to

nb_nodes = nb_nodes.view(-1)
nb_weights = nb_weights.view(-1)
cur_nodeset = torch.cat([nb_nodes[nb_weights > 0], cur_nodeset]).unique()

Essentially, the nodes with 0 weights shouldn't take part in the next round anyway.

familyld commented 5 years ago

@BarclayII Thanks a lot!

VoVAllen commented 5 years ago

Closed due to inactive. Please feel free to reopen this if you still have any question.