NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

ICML '21 | Accelerating natural gradient with higher-order invariance. #31

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Song Y, Song J, Ermon S. Accelerating natural gradient with higher-order invariance.

NorbertZheng commented 1 year ago

Related Reference

NorbertZheng commented 1 year ago

Overview

An overview of Yang Song's ICML 2018 paper, Accelerating Natural Gradient with Higher-Order Invariance.

In this paper, we study the invariance of natural gradient from the perspective of Riemannian geometry and propose several new update rules to improve its invariance. Empirical results show that better invariance can result in faster convergence in several supervised learning, unsupervised learning and reinforcement learning applications.

NorbertZheng commented 1 year ago

Probabilistic models are important for inference and prediction in many areas of machine learning. In this post, we focus our discussion on parametric probabilistic models, e.g. models that can be determined by specifying their parameters. Such a model may have several different parameterizations. For illustrative purposes, let's consider a simple linear regression model that captures the joint distribution of input $x$ and label $y$:

$$ p_{\theta}(x,y)=p(x)\mathcal{N}(y|{\theta}x+b,\sigma^{2}),\theta \in \mathbb{R}, $$

where $\theta$ is the parameter, and $b$, $\sigma$ are two fixed constants. By substituting $\theta$ with $2\mu$, a different parameterization can be created

$$ p_{\mu}(x,y)=p(x)\mathcal{N}(y|2{\mu}x+b,\sigma^{2}),\mu \in \mathbb{R}. $$

Note that even if $p{\theta}(x,y)$ and $p{\mu}(x,y)$ have different analytical forms, they actually represent the same model family. In particular, $p{\theta=\alpha}(x,y)=p{\mu=\frac{\alpha}{2}}(x,y)$ for any $\alpha \in \mathbb{R}$.

Given a loss function $L(p(x,y))$, the goal of an optimization algorithm is to find the model $p^{*}(x,y)$ such that $L(p^{*}(x,y))$ is as small as possible, in the form of finding the optimal parameter $\theta^{*}$ or $\mu^{*}$ under some parameterization.

In the following section, we use the linear regression model as a running example to elaborate on why this can happen.

NorbertZheng commented 1 year ago

Why optimization can depend on parameterization

Returning to the example of linear regression, a typical loss function is the expected negative log-likelihood, which can be formally written as

$$ L(p(x,y))=-\mathbb{E}_{x}[logp(y|x)], $$

For $p_{\theta}(x,y)$,

$$ L(p(x,y))=\mathbb{E}_{x}\left[\frac{1}{2\sigma^{2}}(\theta x+b-y)^{2}+log\sqrt{2\pi}\sigma\right], $$

For $p_{\mu}(x,y)$,

$$ L(p(x,y))=\mathbb{E}_{x}\left[\frac{1}{2\sigma^{2}}(2\mu x+b-y)^{2}+log\sqrt{2\pi}\sigma\right]. $$

Consider using gradient descent to minimize the loss function:

$$ \nabla{\theta{i}}L(p{\theta{i}}(x,y))=\mathbb{E}{x}\left[\frac{x}{\sigma^{2}}(\theta{t}x+b-y)\right] $$

$$ \nabla{\mu{i}}L(p{\mu{i}}(x,y))=\mathbb{E}{x}\left[\frac{2x}{\sigma^{2}}(2\mu{t}x+b-y)\right] $$

Let's assume $\theta{t}=2\mu{t}=\alpha$, i.e. at $t$-th optimization step, $p{\theta{t}=\alpha}(x,y)$ represents the same probabilistic model as $p{\mu{t}=\frac{\alpha}{2}}(x,y)$. However, because of this, the gradient has different scales:

$$ \nabla{\theta{t}}L(p{\theta{t}}(x,y))\vert{\theta{t}=\alpha}=\frac{1}{2}\nabla{\mu{t}}L(p{\mu{t}}(x,y))\vert{\mu{t}=\frac{\alpha}{2}} $$

and therefore,

$$ \theta{t+1}=\theta{t}-\alpha\nabla{\theta{t}}L(p{\theta{t}}(x,y)), $$

$$ \theta{t+1}=2\mu{t}-\frac{\alpha}{2}\nabla{\mu{t}}L(p{\mu{t}}(x,y)), $$

where,

$$ 2\mu{t+1}=2\mu{t}-2\alpha\nabla{\mu{t}}L(p{\mu{t}}(x,y)), $$

thus we have,

$$ \theta{t+1}\neq2\mu{t+1}. $$

Hence the $(t+1)$-th optimization step will result in different probabilistic models

$$ p{\theta{t+1}}(x,y)\neq p{\mu{t+1}}(x,y). $$

This dependence on model parameterization can be undesirable, as it requires carefully choosing the parameterization to avoid hindering optimization.

We can view normalization methods of network activations, such as Batch Normalization, Layer Normalization, and Weight Normalization, as special parameterizations of the network to facilitate first-order gradient descent methods.

Luckily, this dependence on model parameterization is not universally true for every optimization method. For example, the Newton-Raphson method is invariant to affine transformations of model parameters.Therefore, for the linear regression case discussed above, the Newton-Raphson method will ensure that

$$ p{\theta{t+1}}(x,y)=p{\mu{t+1}}(x,y) $$

, as long as $p{\theta{t}}(x,y)=p{\mu{t}}(x,y)$, because the new parameter $\mu=\frac{1}{2}\theta$ is an affine transformation of $\theta$.

Following this thread, we next discuss the Natural Gradient Method - an optimization method that uses a similar preconditioning matrix to that of the Newton-Raphson method but is said to be

when the learning rate is small enough.

NorbertZheng commented 1 year ago

Natural Gradient Method

From the example of applying gradient descent to linear regression, we can easily see that switching between model parameterizations can cause the gradients and parameters to change in different ways. So how can we ensure that when the parameterization changes, the gradient will also change in a clever way to compensate for its effect on these optimization algorithms?

In 1998, inspired by research on information geometry, S Amari proposed to use natural gradient to solve this problem. A full understanding of natural gradient method requires some basic concepts from Riemannian Geometry. Although it is not required for understanding the reset of the post, we encourage interested readers to review the following introduction to Riemannian Geometry.

NorbertZheng commented 1 year ago

An introduction to Riemannian Geometry

The purpose of this text is to give readers the minimal background to understand the motivations behind natural gradient, invariance, and our proposed improvements in the later part of this blog. Good theory starts with beautiful notation. We will first introduce Einstein Summation Convention, a great method to eschew boilerplates in differential geometry formulas.

Einstein Summation Convention

This convention states that

For exmaple,

$$ a^{\mu}b{\mu}:=\Sigma{\mu=1}^{n}a^{\mu}b_{\mu} $$

, when index variable $\mu \in [n]$. Note that here the superscripts are not exponents but are indices of coordinates, coefficients, or basis vectors. We will write the summation symbol $\Sigma$ explicitly if we do not assume Einstein's Summation Convention on some index, for instance,

$$ \Sigma{\nu=1}^{m}a{\nu}b{\nu}c^{\mu}d{\mu}:=\Sigma{\mu=1}^{n}\Sigma{\nu=1}^{m}a{\nu}b{\nu}c^{\mu}d_{\mu} $$

NorbertZheng commented 1 year ago

Riemannian Manifold

Riemannian Geometry is used to study

In machine learning, we often describe the space of some probabilistic models as a manifold. Informally, a manifold $\mathcal{M}$ of dimension $n$ is a smooth space whose local regions resemble $\mathbb{R}^{n}$. Smoothness allows us to define a one-to-one local smooth mapping $\phi:\mathcal{U} \subset \mathcal{M} \to \mathbb{R}^{n}$ called chart or coordinate system, and for any $p \in \mathcal{U}$, $\phi(p)$ represents the coordinates of $p$. For instance, if $p$ is a parameterized distribution, $\phi(p)$ will refer to its parameter, e.g. abstract structure. Generally, manifolds may need a group of charts for complete description.

NorbertZheng commented 1 year ago

Tangent and cotangent spaces

Local resemblance to $\mathbb{R}^{n}$ provides the foundation for defining a linear space attached to each point $p \in \mathcal{M}$, called the tangent space $\mathcal{T}_{p}\mathcal{M}$. Formally,

$\mathcal{T}_{p}\mathcal{M}$={ $\gamma'(0)$: $\gamma$ is a smooth curve, $\gamma$: $\mathbb{R} \to \mathcal{M}$, $\gamma(0)=p$ } is the collection of all vectors obtained by differentiating smooth curves through $p$, and consists of all tangent vectors to the manifold at point $p$. Each tangent vector

NorbertZheng commented 1 year ago

Geodesics

NorbertZheng commented 1 year ago

Summary

As a summary, we provide a graphical illustration of relevant concepts in Riemannian Geometry. tangent_plane_plot From the above formulation, we observe that every concept and relation is defined via coordinates or coefficients. However, we need to discern the important ontological difference between an object and its coordinates.

Riemannian geometry studies intrinsic properties of manifolds via the lens of coordinates. As long as the coordinates are transformed accordingly, e.g., both transforms covariantly or contravariantly, the objects and their relations will not change. This is where invariance emerges.

NorbertZheng commented 1 year ago

Relation to natural gradient method

Let

$$ r{\theta}(z)=p{\theta}(t \vert x)q(x) $$

be a probabilistic model parameterized by $\theta \in \Theta$, and

$$ L(r{\theta})=-\mathbb{E{x \sim q}}[logp_{\theta}(t \vert x)] $$

be the expected negative log-likelihood. Here, $x$, $t$ are random variables, $z=(x,t)$ is their joint, $q(x)$ is the marginal distribution of $x$ and is independent of $\theta$.

In the Riemannian Geometry framework, the set of all possible models $r{\theta}$ constitutes a manifold $\mathcal{M}$, and the parameter $\theta$ provides a chart. Our goal is to find a model $r{\theta^{*}}$ that minimizes the (empirical) loss $L(r_{\theta})$, which is a function defined on the manifold.

The well-known update rule of gradient descent

$$ \theta{k+1}^{\mu}=\theta{k}^{\mu}-h\lambda\partial{\mu}L(r{\theta_{k}}) $$

can be viewed as approximately solving the ordinary differential equation (ODE)

$$ \dot{\theta}^{\mu}=-\lambda\partial{\mu}L(r{\theta}) $$

with forward Euler method. Here $\lambda$ is a time scale constant, $h$ is the step size, and their product $h\lambda$ is the learning rate.

However, the gradient descent ODE is not invariant to reparameterizations. For example, if we rescale $\theta^{\mu}$ to $2\theta^{\mu}$, $\partial{\mu}L(r{\theta})$ will be downscaled to $\frac{1}{2}\partial{\mu}L(r{\theta})$.

This is more evident from a differential geometric point of view. As can be verified by chain rules, $\dot{\theta}{\mu}$ transforms contravariantly and therefore is a vector in $\mathcal{T}{p}^{*}\mathcal{M}$. Accordingly, the l.h.s. and r.h.s. of the ODE transform with different rules and do not type check. As a result, optimizing $L(r_{\theta})$ with gradient descent is using a parameterization-dependent method to solve a parameterization-independent problem, which is aesthetically unsatisfying.

Natural gradient alleviates this issue by approximately solving an invariant ODE. Recall that we can raise or lower an index given a metric tensor $g_{\mu\nu}$, and hereby we choose the Fisher information metric given by

$$ g{\mu\nu}=\mathbb{E{x \sim q}}\mathbb{E{p{\theta}(t \vert x)}}[\partial{\mu}logp{\theta}(t \vert x)\partial{\nu}logp{\theta}(t \vert x)]. $$

By raising the index of $\partial{\mu}L(r{\theta})$, the r.h.s. of the gradient descent ODE becomes a vector in $\mathcal{T}_{p}\mathcal{M}$ and both sides transform contravariantly. The new invariant ODE is now

$$ \dot{\theta}^{\mu}=-\lambda g^{\mu\nu} \partial{\nu}L(r{\theta}) $$

, and the forward Euler approximation becomes

$$ \theta{k+1}^{\mu}=\theta{k}^{\mu}-h\lambda g^{\mu\nu} \partial{\nu}L(r{\theta}) $$

, which is the traditional natural gradient update.

NorbertZheng commented 1 year ago

Intuitively, natural gradient is not the gradient with respect to model parameters; rather, it is

As opposed to the traditional gradient with considers how perturbations of the model parameters affect the loss function, natural gradient considers

Because the model family does not change with respect to parameterizations, the natural gradient will also remain the same. This means that the form of the natural gradient will transform appropriately between parameterizations, automatically ensuring that the natural gradient under two different parameterizations corresponds to the same "gradient entity" on the manifold.

NorbertZheng commented 1 year ago

Mathematically, suppose the loss function is $L(p_{\theta}(x,y))$, we can write the natural gradient as

$$ \widetilde{\nabla{\theta}}L(p{\theta}(x,y)):=F{\theta}^{-1}\nabla{\theta}L(p_{\theta}(x,y)) $$

, where $F_{\theta}$ is the Fisher information matrix for $p_{\theta}(x,y)$, i.e.,

$$ [F{\theta}]^{i,j}:=\mathbb{E{p{\theta}(x,y)}}\left[\left(\frac{\partial}{\partial \theta{i}}logp{\theta}(x,y)\right)\left(\frac{\partial}{\partial \theta{j}}logp_{\theta}(x,y)\right)\right] $$

and we use a tilde mark (\sim) to differentiate the notation of natural gradient from that of ordinary gradient.

As a sanity check, let's see how natural gradient eliminates the non-invariance of gradient descent for our linear regression example. For different parameterizations of Gaussian conditional distributions, the derivatives of the log densities are respectively

$$ \begin{aligned} \frac{\partial}{\partial \theta} \log p\theta(x,y) &= \frac{(y-\theta x- b) x}{\sigma^2}\ \frac{\partial}{\partial \mu} \log p\mu(x,y) &= \frac{2(y - 2\mu x -b) x}{\sigma^2} \end{aligned} $$

Therefore, suppose $\theta{t}=2\mu{t}=\alpha$, we have

$$ \frac{\partial}{\partial \mu} \log p\mu(x,y) \vert{\mu=\mut} = 2 \frac{\partial}{\partial \theta} \log p\theta(x,y)\vert_{\theta=\theta_t} $$

and therefore

$$ \mathbf{F}_{\mu=\mut} = 4\mathbf{F}{\theta=\theta_t} $$

From our previous analysis on gradient descent, we already know that

$$ \nabla\mu L(p\mu(x, y))\vert_{\mu=\mut} = 2 \nabla\theta L(p\theta(x, y))\vert{\theta_t=a} $$

As a result, the natural gradient scales as

$$ \widetilde{\nabla{\mu}} L(p\mu(x,y))\vert_{\mu=\mut} = \frac{1}{2}\widetilde{\nabla{\theta}} L(p\theta(x,y))\vert{\theta=\theta_t}, $$

, which indicates invariance of the optimization step:

$$ \begin{aligned} \theta{t+1} &= \theta{t} - \alpha \widetilde{\nabla_{\thetat}} L(p{\theta_t}(x,y))\ &= 2 \mut - 2 \alpha \widetilde{\nabla{\mut}} L(p{\mut}(x,y))\ &= 2\mu{t+1}. \end{aligned} $$

NorbertZheng commented 1 year ago

More generally, consider an ordinary differential equation (ODE) that describes how the parameter $\theta$ should change as a function of $t$, following the natural gradient smoothly:

$$ \frac{d \theta}{d t} = - \lambda \widetilde{\nabla{\theta}} L(p\theta(x,y)) $$

Here $\lambda$ is just a prespecified scaling constant. If we change the parameterization from $\theta$ to $\mu$, we get a similar ODE:

$$ \frac{d \mu}{d t} = - \lambda \widetilde{\nabla{\mu}} L(p\mu(x,y)). $$

The following observation on invariance can be derived with knowledge of information theory and Riemannian Geometry:

Suppose $\mu$ is a smoothly differentiable transformation of $\theta$. If

$$ p{\theta(t=0)}(x,y) \equiv p{\mu(t=0)}(x,y) $$

, and $\theta(t)$, $\mu(t)$ are solutions of corresponding natural gradient ODEs, then we have

$$ p{\theta(t=a)}(x,y) \equiv p{\mu(t=a)}(x,y) $$

for any $a \geq 0$.

NorbertZheng commented 1 year ago

However, it is usually intractable to solve such a complicated ODE for practical optimization tasks, and various approximation schemes must be used. From the perspective of numerical methods for ODEs, the naive natural gradient descent method

$$ \theta{t+1} = \theta{t} - h \lambda \widetilde{\nabla_{\thetat}} L(p{\theta_t}(x,y)) $$

can be viewed as approximately solving the ODE

$$ \frac{d \theta}{d t} = - \lambda \widetilde{\nabla{\theta}} L(p\theta(x,y)) $$

using the forward Euler method with a step of $h$.

NorbertZheng commented 1 year ago

The following picture illustrates this point more clearly. An illustration of optimization trajectories on the manifold (left), parameter space $\Theta$ (middle) and another parameter space $\Xi$ (right). The red dotted arrow represents a truly invariant trajectory, while the light and dark blue solid arrows represent trajectories of vanilla natural gradient updates under different parameterizations. curved_path straight_path_1 straight_path_2

Suppose we are doing optimization on a manifold of probabilistic models $\mathcal{M}$, and consider two different parameterizations with parameter spaces denoted as $\Theta$ and $\Xi$ respectively. The red arrow shows the invariant optimization trajectory obtained by accurately solving the natural gradient ODE. The light and dark blue arrows depict the optimization trajectory of the vanilla natural gradient descent under two different parameterizations. For the red arrows, we show the trajectory on manifold $\mathcal{M}$, as well as trajectories in parameter space $\Theta$ and $\Xi$. For the blue arrows, we only show trajectories on $\mathcal{M}$ and the parameter space.

Because vanilla natural gradient descent is a linear update, the blue trajectories in parameter space $\Theta$ and $\Xi$ are straight lines. However, in order to obtain the invariant trajectory (red arrow), the update has to be curved in parameter space. The update of the vanilla natural gradient descent is straight in the parameter space, and therefore its trajectory will not be truly invariant on $\mathcal{M}$. For more invariant solutions, we need more complicated update rules that can perform curved updates in parameter space as well.

Because the natural gradient ODE solution is invariant, a straight-forward approach to improving the invariances of the optimization is to exploit more accurate solvers. In numerical analysis, the accuracy of a solver is measured by computing its convergence order to the exact solution when step $h \to 0$. We say that

The forward Euler solver used in the traditional natural gradient update is a simple first-order solver.

In a similar way, we can also measure the "invariance order" of a numerical ODE solver. We say that

Because the exact solution of the natural gradient ODE is invariant, any $d$-th order accurate solver is also $d$-th order invariant. We therefore have two ways to improve the invariance of natural gradient update:

NorbertZheng commented 1 year ago

Numerical ODE solvers that achieve better invariance

We discuss two approaches to improve the invariance of the natural gradient update.

More Accurate Solvers

Natural gradient ODE has an invariant solution; we just need to improve the accuracy of our numerical ODE solver such that the optimization trajectory is closer to the invariant one. In the paper, we discussed using a second-order solver called Midpoint Method as a replacement for the first-order forward Euler method.

The Midpoint method can be decomposed into three steps to approximately solve the natural gradient ODE:

$$ \theta{t+1/2} = \theta{t} - \frac{1}{2} h \lambda \widetilde{\nabla_{\thetat}} L(p{\theta_t}(x,y)) $$

$$ \widetilde{\nabla{\theta{t + 1/2}}} L(p{\theta{t+1/2}}(x, y)) $$

$$ \theta{t + 1} = \theta{t} - h \lambda \widetilde{\nabla{\theta{t+1/2}}} L(p{\theta{t+1/2}}(x,y)) $$

NorbertZheng commented 1 year ago

More invariant solvers

Some solvers can be more invariant by making their update rules more agonistic to reparameterizations, even though they are not necessarily more accurate than the forward Euler method. In the paper, we investigated a first-order called the Riemannian Euler Method which is exactly invariant to reparameterizations when approximately solving the natural gradient ODE. In other words, when starting from the same initial probabilistic model, the Riemannian Euler method will return the same final probabilistic model after a fixed number of iterations, as long as the learning rate is fixed.

When applied to solving the natural gradient ODE, the Riemannian Euler method uses the following update rule:

$$ \theta_{t+1} = \operatorname{Exp}(\thetat, -h \lambda \widetilde{\nabla{\thetat}} L(p{\theta_t}(x, y))), $$

where $\operatorname{Exp}(\cdot, \cdot)$ is the Exponential Map, a concept related to geodesics in Riemannian geometry. Unfortunately, evaluating $\operatorname{Exp}(\cdot, \cdot)$ requires calculating geodesics on the manifold of probabilistic models, which is usually intractable.

The good news is that

Leveraging the geodesic equation - an ODE governing geodesic lines on a manifold - we can obtain a second-order approximation to the exponential map. Equipped with this approximation, the update rule now becomes

image

Here $\Gamma_{\theta_t}(\cdot, \cdot)$ represents the result of applying the Levi-Civita connection to input vectors. We name this update rule natural gradient update with geodesic correction.

An interesting observation is that if we use the naive natural gradient update rule, it amounts to using the Riemannian Euler method with first-order approximation to $\operatorname{Exp}(\cdot, \cdot)$. This observation confirms that geodesic correction can provide more invariance (asymptotically).

NorbertZheng commented 1 year ago

The new numerical ODE solvers we explored so far are all computationally more expensive than the naive natural gradient update.

Both methods require about twice the computational cost compared to the naive natural gradient update. Luckily, we discovered that there exists a faster method for doing geodesic correlation which still ensures a second-order approximation to the exponential map, preserving invariance. This faster geodesic correlation is rough as computationally expensive as the naive natural gradient update. Please refer to our paper for more details.

NorbertZheng commented 1 year ago

Overview of experimental results

Finally, we

We first examine a toy problem to demonstrate that our methods achieve better invariance, then show that they lead to improved optimization for several tasks in various areas of machine learning, including supervised/unsupervised learning and reinforcement learning.

Testing invariance on a toy problem

We experiment with different natural gradient updates to fit a univariate Gamma distribution under different parameterizations. The canonical parameterization of a Gamma distribution is

$$ p(x \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}, $$

, where $\Gamma(\cdot)$ denotes the Gamma function.

Aside from this canonical parameterization, we test three other parameterizations:

Next, we test all the natural gradient updates to see how well they can fit the parameters of a univariate Gamma distribution under different parameterizations. In the following figure, we show how the negative log-likelihood (our training loss) decreases with respect to training iterations. We use $ng$ to denote the naive natural gradient update, $mid$ for midpoint integrator, $geo$ for geodesic correction, $geo_f$ for the faster version of geodesic correction, $ng(exact)$ for solving the natural gradient ODE exactly, and $geo(exact)$ for using the exact Riemannian Euler method (without approximating the exponential map). gamma

As the theory predicts, the curves of $ng(exact)$ and $geo(exact)$ are the same for all four different parameterizations. All of our proposed methods - $mid$, $geo$, and $geo_f$ - are closer to the exact invariant optimization curves. In stark contrast, the naive natural gradient update leads to different curves under different parameterizations and yields slower convergence.

NorbertZheng commented 1 year ago

Fitting deep auto-encoders and deep classifiers

We then show the results of training deep autoencoders and classifiers on the MNIST dataset. The following figure illustrates how the different optimization algorithms fare against each other. The bottom x-axis corresponds to the number of iterations during training, while the top x-axis corresponds to the wall-clock time. The y-axis shows the training error. Each optimization algorithm has two curves with the same color: one is solid while the other is dashed. We can observe that improved natural gradient updates converge faster in terms of number of iterations, and the faster geodesic correction also achieves lower loss in shorter wall-clock time, compared to the naive update. fig

NorbertZheng commented 1 year ago

Model-free Reinforcement learning for continuous control

Finally, we show that our techniques also improve the performance of natural gradient-based optimization algorithms in reinforcement learning. In the experiments, we apply our techniques to ACKTR, a natural gradient algorithm for policy optimization in reinforcement learning. The results on the HalfCheetah environment are shown in the following figure, where we obtain higher rewards faster after combining our techniques with ACKTR. acktr

NorbertZheng commented 1 year ago

Conclusion

In this work, we analyze the invariance of optimization algorithms. Based on our analysis, we propose several methods to improve the invariance of natural gradient update. We empirically report that being more invariant leads to faster optimization for many machine learning tasks.