Closed Simsso closed 5 years ago
Example for an application where graph networks predict a binary variable from a graph: Learning a SAT-solver from single-bit supervision. They solve the "set to scalar" problem as follows:
After T iterations, we compute L(T)∗ ← Lvote(L(T)) ∈ R^2n, which contains a single scalar for each literal (the literal’s vote), and then we compute the average of the literal votes y(T) ← mean(L(T)∗) ∈ R. We train the network to minimize the sigmoid cross-entropy loss between the logit y(T) and the true label φ(P).
In essence that translates to converting the nodes in the graph to scalars and computing their mean. In Deep Mind's graph network paper that could be seen as a global state of the graph.
The network is supposed to be able to rank websites relative to each other. That problem is well known and there is lots of related work. See Wikipedia on "Learning to rank" + paper Learning to Rank for Information Retrieval by Tie-Yan Liu (2009).
Distillation of the "Learning to Rank" paper and other sources (referenced below):
Given a single piece of data, predict its rank.
In our case that would be (for example) a scalar in [0,1], which can then be projected to [1,100 000] to compute the actual, predicted rank. The projection would be logarithmic to account for the significance of rank difference on high ranks vs. low ranks.
Alternatively, the ranks can be grouped into bins: For instance (a) 1-100, (b) 101-1000, (c) 1001-10,000, (d) 10,001-100,000. That way the pointwise approach can be seen as a classification problem (softmax output with cross entropy loss against one-hot encoded target).
Since the bins are ordinal (have an order), solutions to ordinal classification problems should be taken into consideration. In this Kaggle challenge people came up with "A Neural Network Approach to Ordinal Regression" (paper). Instead of using a softmax, the output units would be logistic.
For both bin-approaches, it is important to make up for the variation of sample count per class; e.g. by weighing the loss.
Given two data samples, predict the relative ranking.
The pairwise approach does not focus on accurately predicting the relevance degree of each document; instead, it cares about the relative order between two documents. In this sense, it is closer to the concept of "ranking" than the pointwise approach.
In mathematical terms, the model is a function f(x1,x2) taking to samples mapping to [-1,1], where the output -1 indicates that x1 has a higher rank, and vice-versa for +1.
Takes a list of samples, outputs either a list of ranks or a permutation vector.
The listwise approach can be divided into two sub-categories. For the first sub-category, the output space contains the relevance degrees of all the documents associated with a query (i.e., y), and the loss function is defined based on the approximation or bound of widely used IR evaluation measures. For the second sub-category, the output space contains the permutation of the documents associated with the same query (i.e., πy), and the loss function measures the difference between the permutation given by the hypothesis and the ground truth permutation.
Recently, Google has released a TF-Ranking library.
The focal loss automatically adjusts the weighting of easy and hard classes.
Very important (less in terms of evaluation but rather in terms of training) is the paper Learning to Rank using Gradient Descent.
Extending the weighting with a base parameter:
Added in this commit: https://github.com/Simsso/Pagerank-Estimation-Deep-Graph-Networks/commit/858b892774ed14f64ff1254fb9f738bff53321f5
Design the estimator output (bins, relative comparison).
Loss function design depending on the chosen output.