Stochastic Gradient / Mirror Descent: Minimax Optimality and Implicit Regularization
Data-Driven Mirror Descent in Input-Convex Neural Networks
Above tells us that mirror descent on data ${x_i, yi}$ with potential $\phi$ converges to $\operatorname*{argmin}{w \in W^} D_{\phi} (w | w_0)$ (where $W^$ is the set of weights that interpolate the data and $w_0$ is the intialization).
Can we learn a useful data-dependent $\phi$ that improves generalization?
Linear model: $yi = w*^\top x_i + \varepsilon_i$, potential $\phi(x) = x^\top Q x$
Linear model, more general choices of $\phi$
More realistic data (e.g. MNIST and scale up), general $\phi$ (ICNN?)