facebookresearch / PyTorch-BigGraph

Generate embeddings from large-scale graph-structured data.
https://torchbiggraph.readthedocs.io/
Other
3.38k stars 449 forks source link

How does PBG evaluate unseen nodes and edges in the test set? #226

Open Yuan-Meng opened 3 years ago

Yuan-Meng commented 3 years ago

This may be a naive question but I wonder how PBG can perform evaluation on the test set when certain nodes might have never appeared in the training set. Does PGB skip such nodes and edges formed by them, or does it have a mechanism for inferring embeddings of unseen nodes? If so, what's the basis for this kind of inductive inference? And if that's the case, can we use a trained PGB model to predict new nodes in future data?

I read the PyTorch-BigGraph paper but don't remember seeing any details on this. Any answers or pointers would be highly appreciated!

lw commented 3 years ago

It's quite naive: all embeddings, for all nodes, are initialized randomly at the beginning (using a multidimensional normal distribution), and if they are never "touched" during training (because no edge uses them), then these nodes will keep having that random embedding forever.

Note that however if your graph is so sparse that this scenario is a legitimate concern, perhaps you should reconsider how you use PBG. I don't remember exactly now (it's been a while) but I think our rule of thumb was that any node with fewer than 50 incident edges risks being trained poorly. You may be better off in fact by removing these nodes before training (by implementing a custom preprocessing step) and then try to infer an embedding for them after training (e.g., the average of their few neighbors).

Yuan-Meng commented 3 years ago

Thank you for your answer! It's really helpful! I have a follow-up question: What's the default behavior when PBG performs evaluation on the test set? Does it infer embeddings of unseen nodes like you described (e.g., using average of neighbors) or use randomly initialized embeddings? In the latter case, I wonder if there's an existing code example for inferring embeddings efficiently (e.g., searching through the edge list again to identity neighbors seems super slow for a large graph?). Thanks!