Open BarclayII opened 2 years ago
Will we still need to make negative edges part of the graph?
cc @rudongyu @Ereboas If you have not checked this issue, it will be great if you can take a look and give some feedback.
Will we still need to make negative edges part of the graph?
In this case no.
🚀 Feature
Support iterating over node pairs in DataLoader.
Motivation
Currently there are two ways to perform link prediction evaluation with given positive and negative samples:
One is to have the positive and negative examples as a part of the graph, and iterate over the edges with
as_edge_prediction_sampler
, exclude them during sampling, and treat the evaluation as binary edge classification:This is quite complicated, and will be even more complicated if one evaluates on heterogeneous graphs. It is also neither efficient nor scalable since the edges will have to be excluded for every sampling operation, and the validation edge IDs are given as a tensor during edge exclusion (which doesn't work for distributed training).
Another is to evaluate link prediction by computing the node representations and then computing the scores from the incident node representations. Unfortunately this is not possible for subgraph-representation-based link prediction methods such as SEAL.
Alternatives
If #4441 is implemented, one can create a single graph with all the training edges, positive validation edges, and negative validation edges. Then during training and validation, one creates a sampler that only samples on the training edges. One treats validation as evaluating binary edge classification.
This is however inconvenient for validating and testing on new edges during deployment when the node pairs to predict may vary from time to time, because one needs to create a new graph every time the node pairs to predict changes.
Pitch
The user experience will look like the following:
We need a sampler wrapper function
as_node_pair_prediction_sampler
, similar toas_edge_prediction_sampler
. The signature will go as follows:I'm not sure if we need edge exclusion here. At least edge exclusion seems unnecessary for link prediction validation because the graph will not contain the validation edges anyway.
The function
as_node_pair_prediction_sampler
will create aNodePairPredictionSampler
object, whose sample method will take in the graph as well as a pair of (1) Nx2 tensor and (2) the indices to the entire node pair set (or a dict of edge types and Nx2 tensors). The indices will be used for prefetching labels and will be assigned to the pair graph by assigning topair_graph.edata[dgl.EID]
.The return value will be the same as
as_edge_prediction_sampler
.The only issues I have with this UX are
as_node_pair_prediction_sampler
will expect both a slice of the indices tensor and the slice indices themselves (for prefetching labels), which is different from node and edge sampling where they only expect a single indices tensor.pair_graph.edata[dgl.EID]
does not refer to the edge IDs in the original graph; they rather point to the indices of the given node pairs. This may cause confusion.I'm not sure if we have a better UX for this.
Additional context
This requirement surfaces from the implementation of SEAL on OGB datasets from @rudongyu .
EDIT (8/23): changes to the issues and clarified the behavior of
as_node_pair_prediction_sampler
.