Functions and derivative

Vector and matrix

vector: $\mathbf{x} \in \mathbb{R}^d$

matrix: $A \in \mathbb{R}^{m\times n}$

\langle \mathbf{x},\mathbf{y} \rangle = \mathbf{x}^T \mathbf{y} = \mathbf{y}^T \mathbf{x}

A \in \mathbb{R}^{m\times n},B \in \mathbb{R}^{p\times n}, C=A \cdot B^T

\langle A, B \rangle = \sum_{i,j} A_{ij}B_{ij} = \sum_{i=1}^{n} C_{ii}=tr(C)=tr(A \cdot B^T) = tr(B^T \cdot A)

\langle \mathbf{x},\mathbf{y} \rangle=tr(\mathbf{x}^T \mathbf{y})= \mathbf{x}^T \mathbf{y} = tr(\mathbf{y} \mathbf{x}^T)

Function

linear function $f(x)=Ax+b$, $\mathbf{x} \in \mathbb{R}^d$, $A \in \mathbb{R}^{n \times d}$, $\mathbf{b} \in \mathbb{n}$.

The derivative of a function of one variable is itself a function of one variable– it simply is (roughly) defined as the linearization of a function.

f(x): \mathbb{R} \rightarrow \mathbb{R}, f'(x)=\lim_{x\rightarrow 0} \frac{f(x+h)-f(x)}{h}

f(x+h)=f(x)+f'(x)h+o(\vert h\vert)

df=f'(x)dx

$dx$ and $dy$ are called infinitesimals.

In the multivariable case, what $h \rightarrow 0$ means is less clear, as there are many directions in which one could approach a point in $\mathbb{R}^n$.

Given a vector $\mathbf{d}$ with the same dimension as $\mathbf{x}$, we could consider the limit

\nabla f(\mathbf{x})[\mathbf{d}]:=\lim _{t \rightarrow 0} \frac{f(\mathbf{x}+t \mathbf{d})-f(\mathbf{x})}{t},

which may be thought of as a function of both $\mathbf{x}$ and $\mathbf{d}$. If we want a definition for the multidimensional derivative $\frac{d f}{d \mathbf{x}}$ at a given point $\mathbf{x}$, it should not depend on $d$.

It turns out, assuming that the function $f$ is differentiable, that there exists a vector $\nabla f$ such that $\nabla f(\mathbf{x})[\mathbf{d}]=\nabla f(\mathbf{x}) \cdot \mathbf{d}$ for all $\mathbf{d} \in \mathbb{R}^n$, allowing us to separate the direction $\ \mathbf{d}$ and the actual multidimensional derivative.

In particular, the expression for this $\nabla f(\mathbf{x})$ that satisfies the above property is

\nabla f(\mathbf{x})=[\begin{array}{llll}
\frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \ldots & \frac{\partial f}{\partial x_n}
\end{array}] .

This gets the name "gradient" as it represents the set of slopes around a point as one moves one unit in each dimension parallel to the $n$ axes.

The gradient vector represents the derivative for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$. If the function is differentiable, the gradient is equal to the $1 \times n$ vector where the $i$ th entry is $\left[\frac{\partial f}{\partial \mathbf{x}}\right]_i=\frac{\partial f}{\partial x_i}$.

Next, we turn to functions $\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m$ where both the input and output are vectors. We treat the gradient vectors for each entry separately.

As we have defined the gradient for a single-variable function as a row vector, for a function with vector output we could stack these $m$ row vectors on top of one another to get an $m \times n$ matrix.

This matrix is called the Jacobian.

\left[\begin{array}{c}
\frac{d f_1}{d \mathbf{x}} \\
\frac{d f_2}{d \mathbf{x}} \\
\vdots \\
\frac{d f_m}{d \mathbf{x}}
\end{array}\right]=\left[\begin{array}{c}
\nabla f_1(\mathbf{x}) \\
\nabla f_2(\mathbf{x}) \\
\vdots \\
\nabla f_m(\mathbf{x})
\end{array}\right]=\left[\begin{array}{cccc}
\frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{array}\right] .

This definition allows us to extend the limit definition of a multivariable derivative to the Jacobian, as it only involves stacking gradients:

\lim _{t \rightarrow 0} \frac{\mathbf{f}(\mathbf{x}+t \mathbf{d})-\mathbf{f}(\mathbf{x})}{t}=\nabla \mathbf{f}(\mathbf{x})[\mathbf{d}]=\left[\begin{array}{c}
\nabla f_1(\mathbf{x}) \\
\nabla f_2(\mathbf{x}) \\
\vdots \\
\nabla f_m(\mathbf{x})
\end{array}\right] \cdot \mathbf{d}=J_f \cdot \mathbf{d}

Example:

We can decompose function of multidimensional output:

f_i(\mathbf{x})=\sum_{j=1}^{d} A_{ij}x_j + b_i

g_{ij}(h)=f_i(\mathbf{x}+\mathbf{e}_j h)-f_i(\mathbf{x})

g_{ij}'(h)=\lim_{h \rightarrow 0} \frac{A_{ij}h}{h}=A_{ij}

\frac{\partial f}{\partial x}=[\frac{\partial f_i}{\partial x_j}]_{ij}

\frac{\partial f}{\partial x}=A

The Jacobian matrix represents the derivative for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$. It is defined as the $m \times n$ matrix where the term at the $i t h$ row and $j$ th column is $\left[\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\right]_{\mathrm{ij}}=\frac{\partial f_i}{\partial x_j}$.

If we consider the gradient to be a function itself, as in $\nabla f: \mathbb{R}^n \rightarrow \mathbb{R}^n$, transpose into a column vector, and then taking the Jacobian of the transpose. Transposing again gives the Hessian $H_f$ :

\mathbf{H}_f(\mathbf{x})=\mathbf{J}_f\left((\nabla f(\mathbf{x}))^T\right)^T

\mathbf{H}_f(\mathbf{x})=\left[\begin{array}{cccc}
\frac{\partial^2 f(x)}{\partial x_2^2} & \frac{\partial^2 f(x)}{\partial x_2 \partial x_2} & \ldots & \frac{\partial^2 f(x)}{\partial x_1 \partial x_n} \\
\frac{\partial^2 f(x)}{\partial x_2 \partial x_1} & \frac{\partial^2 f(x)}{\partial x_2^2} & \ldots & \frac{\partial^2 f(x)}{\partial x_2 \partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f(x)}{\partial x_n \partial x_1} & \frac{\partial^2 f(x)}{\partial x_n \partial x_2} & \ldots & \frac{\partial^2 f(x)}{\partial x_n^2}
\end{array}\right]=\left[\begin{array}{c}
\frac{\partial}{\partial \mathbf{x}}[\nabla f(\mathbf{x})]_1 \\
\frac{\partial}{\partial \mathbf{x}}[\nabla f(\mathbf{x})]_2 \\
\vdots \\
\frac{\partial}{\partial \mathbf{x}}[\nabla f(\mathbf{x})]_n
\end{array}\right]^T=J_f\left((\nabla f(\mathbf{x})^T)^T\right.

The Hessian matrix, denoted $Hf$ represents the second derivative for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$. It is defined as the $n \times n$ matrix where the term at the $i$ th row and $j$ th column is $\left[\frac{\partial^2 f}{\partial \mathbf{x} \partial \vec{x}^{\top}}\right]{i j}=\frac{\partial^2 f}{\partial x_i \partial x_j}$.

The Hessian matrix is symmetric because $\frac{\partial^2 f}{\partial x_i \partial x_j}=\frac{\partial^2 f}{\partial x_j \partial x_i}$ subject to certain analytic conditions that are satisfied by most continuous functions used in statistics.

\frac{d \mathbf{f}}{d \mathbf{x}}=\left[\begin{array}{llll} \frac{\partial \mathbf{f}}{\partial x_1} & \frac{\partial \mathbf{f}}{\partial x_2} & \cdots & \frac{\partial \mathbf{f}}{\partial x_n} \end{array}\right] .

\frac{d \mathbf{f}}{d \mathbf{x}}=\left[\begin{array}{cccc} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{array}\right]=\left[\frac{\partial f_i}{\partial x_j}\right]_{1 \leq i \leq m, 1 \leq j \leq n} .

\frac{\partial \mathbf{f}}{\partial \mathbf{x}}=\left[\begin{array}{cccc} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_2}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_1} \\ \frac{\partial f_1}{\partial x_2} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_2} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_1}{\partial x_n} & \frac{\partial f_2}{\partial x_n} & \cdots & \frac{\partial f_m}{\partial x_n} \end{array}\right]

ChufanSuki / read-paper-and-code

Functions and derivative #141

Vector and matrix

Function

Derivative

Derivatives as Linear Operators