Automatic Differentiation

chufanchen commented 7 months ago

Leibniz's notation: $\frac{d^nf}{dx^n}$, $df=f'(x) \cdot dx$
Lagrange's notation: $f^{(n)}(x)$
D-notation: $(Df)(x)=\frac{df}{dx}(x)$
Newton's notation: $\dot{y}$

Confusing Terminology

Automatic differentiation(autodiff)
Backpropagation
Autograd

Backpropagation is an instance of reverse mode automatic differentiation.

\frac{\partial f(\mathbf{x})}{\partial x_i} \approx \frac{f\left(\mathbf{x}+h \mathbf{e}_i\right)-f(\mathbf{x})}{h}

Autodiff is not finite differences(only use it for testing). Numerical approximations of derivatives are inherently ill-conditioned and unstable("thou shalt not add small numbers to big numbers", and "thou shalt not subtract numbers which are approximately equal"). This is due to the introduction of truncation and round-off error.

Truncation error is the error of approximation, or inaccuracy, one gets from $h$ not actually being zero. It is proportional to a power of $h$.

Round-off error is the inaccuracy one gets from valuable low-order bits of the final answer having to compete for machine-word space with high-order bits of $f(\mathbf{x} + h \mathbf{e}_i)$ and $f(\mathbf{x})$, which the computer has to store just until they cancel in the subtraction at the end. Round-off error is inversely proportional to a power of $h$.

Autodiff is not symbolic differentiation.

An autodiff system will convert the program into a sequence of primitive operations(ops) which have specified routines for computing derivatives. In this representation, backprop can be done in a completely mechanical way. This is “just” a clever and efficient use of the Chain Rule for derivatives.

Example: Jax(autograd), Pytorch, MxNet

chufanchen commented 7 months ago

We adopt the three-part notation used by [Griewank, 2008] where a function $f:\mathbb{R}^n \rightarrow \mathbb{R}^m$ is constructed using intermediate variables $v_i$ such taht:

variables $v_{i-n}=x_i$, $i=1,\ldots,n$ are input variables,
variables $v_i$ $i=1, \ldots, l$ are the working (intermediate) variables, and
variables $y{m-i}=v{l-i}$, $i=m-1, \ldots, 0$ are the output variables.

Automatic differentiation proceeds in two stages. First, function definition $f$ in the sublanguage are converted to a format called a Wengert list[Wengert, 1964](also called an evaluation trace). The second stage is to evaluate the function and its gradients using the Wengert list.

Example: $y=f(x_1,x_2)=ln(x_1)+x_1x_2-sin(x_2)$, compute $\frac{\partial y}{\partial x_1}$ using forward mode AD

The left hand side of the table is the Wengert list(or evaluation trace of elementary operations).

Forward Mode

AD in forward accumulation mode(also called tangent linear mode) is the conceptually most simple type. We want to compute the Jacobian of a function $f:\mathbb{R}^n \rightarrow \mathbb{R}^m$ with $n$ independent(input) variables $x_i$ and $m$ dependent (output) variables $y_j$.

In this case, each forward pass of AD is initialized by setting only one of the variables $\dot{x}_i = 1$ and setting the rest to zero (in other words, setting $\dot{\mathbf{x}} = \mathbf{e}_i$, where $\mathbf{e}_i$ is the $i$-th unit vector). $n$ runs of the code with specific input values $\mathbf{x} = \mathbf{a}$ then computes the Jacobian matrix

\mathbf{J}_f=\left.\left[\begin{array}{ccc}
\frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{array}\right]\right|_{\mathbf{x}=\mathbf{a}}

Furthermore, forward mode AD provides a very efficient and matrix-free way of computing Jacobian–vector products $J_f \mathbf{r}$ simply by initializing with $\dot{\mathbf{x}}=r$. As a special case, when $f:\mathbb{R}^n \rightarrow \mathbb{R}$, we can obtain the directional derivative along a given vector $\mathbf{r}$ as a linear combination of the partial derivatives $\nabla f \cdot \mathbf{r}$

Forward mode $\mathrm{AD}$ is efficient and straightforward for functions $f: \mathbb{R} \rightarrow \mathbb{R}^m$, as all the derivatives $\frac{d y_i}{d x}$ can be computed with just one forward pass. Conversely, in the other extreme of $f: \mathbb{R}^n \rightarrow \mathbb{R}$, forward mode $\mathrm{AD}$ requires $n$ evaluations to compute the gradient

\nabla f=\left(\frac{\partial y}{\partial x_1}, \ldots, \frac{\partial y}{\partial x_n}\right),

which also corresponds to a $1 \times n$ Jacobian matrix that is built one column at a time with the forward mode in $n$ evaluations.

Reverse mode

AD in the reverse accumulation mode(also called adjoint or cotangent linear mode) corresponds to a generalized backpropagation algorithm, in that it propagates derivatives backward from a given output.

This is done by complementing each intermediate variable $v_i$ with an adjoint

\bar{v_i}=\frac{\partial y_j}{\partial v_i}

which represents the sensitivity of a considered output $y_j$ with respect to changes in $v_i$.

In reverse mode AD, derivatives are computed in the second phase of a two-phase process. In the first phase, the original function code is run forward, populating intermediate variables $v_i$ and recording the dependencies in the computational graph through a book-keeping procedure. In the second phase, derivatives are calculated by propagating adjoints $\bar{v}_i$ in reverse, from the outputs to the inputs.

An important advantage of the reverse mode is that it is significantly less costly to evaluate (in terms of operation count) than the forward mode for functions with a large number of inputs.

Because machine learning practice principally involves the gradient of a scalar-valued objective with respect to a large number of parameters, this establishes the reverse mode, as opposed to the forward mode, as the mainstay technique in the form of the backpropagation algorithm.

In general, for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$, if we denote the operation count to evaluate the original function by $ops(f)$, the time it takes to calculate the $m \times n$ Jacobian by the forward mode is $nc \text{ops}(f)$, whereas the same computation can be done via reverse mode in $mc \text{ops}(f)$, where $c$ is a constant guaranteed to be $c<6$ and typically $c \sim[2,3]$ ([Griewank, 2008]). That is to say, reverse mode AD performs better when $m \ll n$.

Similar to the matrix-free computation of Jacobian-vector products with forward mode, reverse mode can be used for computing the transposed Jacobian-vector product

\mathbf{J}_f^{\top} \mathbf{r}=\left[\begin{array}{ccc}
\frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_1} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_1}{\partial x_n} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{array}\right]\left[\begin{array}{c}
r_1 \\
\vdots \\
r_m
\end{array}\right],

by initializing the reverse phase with $\overline{\mathbf{y}}=\mathbf{r}$.

The advantages of reverse mode $\mathrm{AD}$, however, come with the cost of increased storage requirements growing (in the worst case) in proportion to the number of operations in the evaluated function. It is an active area of research to improve storage requirements in implementations by using advanced methods such as checkpointing strategies and data-flow analysis ([Dauvergne and Hascoët, 2006]; [Siskind and Pearlmutter, 2017]).

Building the Computation Graph

Node: A Node is an abstract class that represents an operation taking zero or more input Variables and producing zero or more output Variables.

Autograd's Node class (defined in tracer.py ) represents a node of the computation graph. It has attributes:

value, the actual value computed on a particular set of inputs
fun, the primitive operation defining the node
args and kwargs, the arguments the op was called with
parents, the parent Nodes

[Griewank, 2008]: Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation

[Wengert, 1964]: A simple automatic derivative evaluation program

[Dauvergne and Hascoët, 2006]: The data-flow equations of checkpointing in reverse automatic differentiation

[Siskind and Pearlmutter, 2017]: Divide-and-conquer checkpointing for arbitrary programs with no user annotation

chufanchen commented 7 months ago

The backprop equation (single child node) can be written as a vector-Jacobian product (VJP):

\overline{x_j}=\sum_i \overline{y_i} \frac{\partial y_i}{\partial x_j} \quad \overline{\mathbf{x}}=\overline{\mathbf{y}}^{\top} \mathbf{J}

That gives a row vector. We can treat it as a column vector by taking

\overline{\mathbf{x}}=\mathbf{J}^{\top} \overline{\mathbf{y}}

we never explicitly construct the Jacobian. It’s usually simpler and more efficient to compute the VJP directly.

Consider a naïve backprop implementation where the $\mathbf{z}$ module needs to compute $\overline{\mathbf{z}}$ using the formula:

\overline{\mathbf{z}}=\frac{\partial \mathbf{r}}{\partial \mathbf{z}} \overline{\mathbf{r}}+\frac{\partial \mathbf{s}}{\partial \mathbf{z}} \overline{\mathbf{s}}+\frac{\partial \mathbf{t}}{\partial \mathbf{z}} \overline{\mathbf{t}}

This breaks modularity, since $\mathbf{z}$ needs to know how it's used in the network in order to compute partial derivatives of $\mathbf{r}, \mathbf{s}$, and $\mathbf{t}$.

Backprop as message passing:

Each node receives a bunch of messages from its children, which it aggregates to get its error signal. It then passes messages to its parents.
Each of these messages is a VJP.
This formulation provides modularity: each node needs to know how to compute its outgoing messages, i.e. the VJPs corresponding to each of its parents (arguments to the function).
The implementation of z doesn’t need to know where z came from.

For each primitive operation, we must specify VJPs for each of its arguments.

defvjp (defined in core.py) is a convenience routine for registering VJPs.

https://github.com/mattjj/autodidact/blob/c32947fbd7f130d791ab1e277a0feed05cb3da5f/autograd/numpy/numpy_vjps.py#L19

The backward pass is defined in core.py. The argument g is the error signal for the end node.

chufanchen commented 7 months ago

mxnet's AD

SimpleBind:

arg: input arguments array
req: requirement type of gradient saving
aux: op's internal state array

chufanchen commented 7 months ago

Jax's AD

$grad::(\mathbb{R}^n \rightarrow \mathbb{R}) \rightarrow (\mathbb{R}^n \rightarrow \mathbb{R}^n)$

chufanchen / read-paper-and-code