acbull / pyHGT

Code for "Heterogeneous Graph Transformer" (WWW'20), which is based on pytorch_geometric
MIT License
771 stars 162 forks source link

Question: Why Use metapath2vec for Input Features? #17

Closed AlexMRuch closed 3 years ago

AlexMRuch commented 3 years ago

While reading through your fascinating paper, I noticed that you all do a huge amount of work initializing the input features. For example, you noted that "For the field, venue, and institute nodes, we use the metapath2vec model [3] to train their node embeddings by reflecting the heterogeneous network structures."

Having worked with metapath2vec and knowledge graphs quite a bit myself, this must have taken a good deal of time and quite a bit of RAM. It confused me to see this kind of processing in the paper given that you said the HGT model should learn metapaths itself. I was expecting to see the HGT model learn these feature representations without requiring all that work up front.

So my question is this: why bother with these steps? Was it simply to speed up training? (The same question applies abstractly to why you used XLNet for papers.)

Many thanks in advance!

Best, Alex

acbull commented 3 years ago

Hi Alex:

Thanks for your interest in our work.

Regarding the input feature for training GNN, or more broadly, any NN, there's still not exist the best way to handle every situation. But I can have some simple intuitions to choose input features for GNN

(1) For some type of nodes that might emerge in the testing data, like new papers, new authors, we should choose some "inductive" features. For example, text, degree, their attributes, etc. The reason we use XLNet as a feature extractor is because people have proved that a powerful pre-trained contextualized embedding model like BERT, XLNet can already capture the linguistic and semantic meaning of the text, so that our GNN model can focus more on their interaction between other nodes instead of learning all the linguistic knowledge from scratch. (Previously people leverage shallow word embedding to solve this task.) Also, a more ideal way is to also finetune the XLnet model in an end-to-end manner, which is very common in NLP tasks. But due to the limitation of computational resources, we just treat XLNet as a feature extractor.

(2) For some type of nodes that are always in the graph, like conference, topics, we can actually assign a learnable embedding to it and let it learns end-to-end. I've done such an experiment, and in some small graphs, it shows superior results than a fixed input vector. But for Microsoft academic graph, the number of topic and conference are still in a super large number. Still, due to the computational resource issue, we just do a tradeoff and assign them with an initial vector learned by shallow embedding technique (i.e., metapath2vec). But I'd highly recommend other people try learning an embedding from scratch for these types of nodes.

AlexMRuch commented 3 years ago

Yeah, I can see that point. In my own work using graph and language models jointly to predict events (https://iopscience.iop.org/article/10.1088/2632-072X/aba83d) I used metapath2vec and doc2vec for input features for authors, submissions, and subreddits; however, I can see your point of how if you are able to use the metapath2vec embeddings to initialize the node features for the HGT model it will help speed up training. (I also got excited about the possibility of avoiding having to code metapaths in part because in networks like Reddit they can be quite complicated depending on what one wants to emphasize in the network.)

I think my major question/concern was just that running the sampling for metapath2vec can take forever so I wasn't sure if the time it takes to do the sampling and then run the embedding helped the HGT model train so much faster that it offset those costs. I saw in another post you liked to the original C++ code, and I know elsewhere people have improved the sampling to be 4-16 times faster, but even in my network of 35.5 million nodes and 190 million edges it took a while (and the sample file for the online algorithm was ~50GB+). Again, I could certainly see it helping – it's certainly helping it not have a cold start problem, which is nice.

On a somewhat unrelated note, do you all have any ideas on how the model could be updated to make inductive predictions on previously unseen nodes (like GraphSAGE/PinSAGE)? I saw you had inductive modeling for time, but didn't see anything in the aggregation that implied inductive learning for new nodes. If that were possible, this would be extremely awesome. I'm presently working on a project and am trying to decide between HGT and PinSAGE because I want it to include inductive learning for time (HGT) as well as new nodes (PinSAGE), but I don't know if someone has figured out how to do both yet.

Thanks again for your time! So nice chatting with someone working on networks and language!

acbull commented 3 years ago

Obviously out method support inductive prediction. For any new nodes, you can just run the sampling to get neighborhood, and calculate their embedding based on HGT. (Actually every gnn model that uses fixed attribute should support inductive prediction).

On Thu, Sep 3, 2020, 17:52 Alex Ruch notifications@github.com wrote:

Yeah, I can see that point. In my own work using graph and language models jointly to predict events ( https://iopscience.iop.org/article/10.1088/2632-072X/aba83d) I used metapath2vec and doc2vec for input features for authors, submissions, and subreddits; however, I can see your point of how if you are able to use the metapath2vec embeddings to initialize the node features for the HGT model it will help speed up training. (I also got excited about the possibility of avoiding having to code metapaths in part because in networks like Reddit they can be quite complicated depending on what one wants to emphasize in the network.)

I think my major question/concern was just that running the sampling for metapath2vec can take forever so I wasn't sure if the time it takes to do the sampling and then run the embedding helped the HGT model train so much faster that it offset those costs. I saw in another post you liked to the original C++ code, and I know elsewhere people have improved the sampling to be 4-16 times faster, but even in my network of 35.5 million nodes and 190 million edges it took a while (and the sample file for the online algorithm was ~50GB+). Again, I could certainly see it helping – it's certainly helping it not have a cold start problem, which is nice.

On a somewhat unrelated note, do you all have any ideas on how the model could be updated to make inductive predictions on previously unseen nodes (like GraphSAGE/PinSAGE)? I saw you had inductive modeling for time, but didn't see anything in the aggregation that implied inductive learning for new nodes. If that were possible, this would be extremely awesome. I'm presently working on a project and am trying to decide between HGT and PinSAGE because I want it to include inductive learning for time (HGT) as well as new nodes (PinSAGE), but I don't know if someone has figured out how to do both yet.

Thanks again for your time! So nice chatting with someone working on networks and language!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/acbull/pyHGT/issues/17#issuecomment-686837217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHREXR4PWUYZ3QILYOMSF6LSEA26LANCNFSM4QWPJH5A .