cynricfu / MAGNN

Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding
388 stars 68 forks source link

Refactoring the code #9

Closed xiaoyuxin1002 closed 3 years ago

xiaoyuxin1002 commented 3 years ago

Hi, I wish to refactor the MAGNN code for my research project. However, I found that the Link Prediction and Node Classification Tasks used different models (MAGNN_lp, MAGNN_nc, MAGNN_nc_mb) in your implementation. So if I wish to produce a generic MAGNN model in an unsupervised setting that can support multi-layer, mini-batch training over more than one type of edges (instead of just user-artist in LastFM), how could I do that based on your implementation? Could you please give me some directions? More specifically, why did you set use_minibatch=False in MAGNN_nc? What will happen if I simply change it to True? Thanks!

cynricfu commented 3 years ago

My codes are not very well designed for the general-purpose setting. You may notice that my code requires pre-processing the heterogeneous graph to generate (and sample) metapath instances beforehand. You may need an efficient algorithm if you want to better utilize the dgl.heterograph API and online generate the metapath instances.

MAGNN_nc is for whole-graph training, each training step is an epoch corresponding to one pass of the entire graph. Therefore I set use_minibatch=False in MAGNN_nc.

Basically, MAGNN_ctr_ntype_specific with use_minibatch=False is a module that computes on the input graphs and return embedding of all nodes. MAGNN_ctr_ntype_specific with use_minibatch=True does the same computation, the only difference is that it returns only the embeddings of the nodes indexed by target_idx.

The term use_minibatch may be a bit confusing here.

xiaoyuxin1002 commented 3 years ago

Thank you for the quick reply! Then does that mean that if I change use_minibatch to True for MAGNN_nc and input the corresponding target_idx, then it will support multi-layer mini-batch training?

cynricfu commented 3 years ago

You can do that. But you still need to construct the GNN training graph on your own. For example, if given a chain graph like this: 1-2-3-4, and we want to obtain the embedding of 1 by applying a 2-layer GNN. Then you need to first feed graph 1-2-3 with target_id=[1, 2] to the GNN to obtain the embeddings for node 1 & 2, and then feed graph 1-2 with target_id=[1] to the GNN to obtain the embedding for node 1.

xiaoyuxin1002 commented 3 years ago

So essentially, different layers require a different set of training graphs and target_ids?

cynricfu commented 3 years ago

Yes. This is just for the computation efficiency's consideration. Theoretically, you can always pass in the entire graph, and only specify several nodes as your targets, but this would require a very large GPU memory size and many wasted computations. The critical question here is whether or not you have an efficient algorithm to construct the training graph online for given target nodes and the number of layers.

xiaoyuxin1002 commented 3 years ago

Sorry, as I am refactoring your codes, could you also briefly explain how parse_minibatch and parse_adjlist work in tools.py? Thanks!

cynricfu commented 3 years ago

parse_adjlist

https://github.com/cynricfu/MAGNN/blob/b8557f58ae04a7fe3e7a9fde64ea87c81b331efe/utils/tools.py#L68 Each element of adjlist and each element of edge_metapath_indices correspond. Each line of adjlist is like "0 0 1 1 2 3", indicating 5 (directed) node pairs pointing from node 0 to node 0, 1, 1, 2, and 3 (multiplexing is allowed here, because there can be different metapath instance connecting the two nodes). Each element of edge_metapath_indices is a list of metapath instances starting with a specific node, corresponding to a line in adjlist. For example, it can be: [[0, 555, 0], [0, 556, 1], [0, 700, 1], [0, 600, 2], [0, 650, 3]] Note that the node indices in adjlist and edge_metapath_indices are not consistent. Node indices in adjlist are local indices of the specific node type. While node indices in edge_metapath_indices are global indices for all nodes. So parse_adjlist basically just parse the adjlist strings, associate the node pairs with their corresponding metapath instances, and do some sampling (if applicable).

parse_minibatch

https://github.com/cynricfu/MAGNN/blob/b8557f58ae04a7fe3e7a9fde64ea87c81b331efe/utils/tools.py#L104 adjlists and edge_metapath_indices_list are similar to the above, except here they are entire data of a specific node type (e.g., author). idx_batch is a set of indices of nodes we want to obtain the node embeddings. For this function's implementation, they must be of the same node type. What parse_minibatch does is to obtain the GNN subgraph for the specified set of nodes. https://github.com/cynricfu/MAGNN/blob/b8557f58ae04a7fe3e7a9fde64ea87c81b331efe/utils/tools.py#L108 Each iteration here is for a specific metapath (e.g. author-paper-author) starting and end with that node type. https://github.com/cynricfu/MAGNN/blob/b8557f58ae04a7fe3e7a9fde64ea87c81b331efe/utils/tools.py#L116 Here I reverse the order of the node pair because the default behavior of DGL on directed edges is to aggregate from source nodes to target nodes. Remember that in the node pairs obtained from adjlist are "from some specific nodes", we want them to be reversed to "to some specific nodes".

ZZy979 commented 3 years ago

I'm also refactoring MAGNN codes, and just finished a first version. You can take a look if you need @xiaoyuxin1002 https://github.com/ZZy979/pytorch-tutorial/tree/master/gnn/magnn