ChufanSuki / read-paper-and-code

0 stars 0 forks source link

Functions and derivative #141

Closed ChufanSuki closed 1 month ago

ChufanSuki commented 1 month ago

Vector and matrix

vector: $\mathbf{x} \in \mathbb{R}^d$

matrix: $A \in \mathbb{R}^{m\times n}$

\langle \mathbf{x},\mathbf{y} \rangle = \mathbf{x}^T \mathbf{y} = \mathbf{y}^T \mathbf{x}
A \in \mathbb{R}^{m\times n},B \in \mathbb{R}^{p\times n}, C=A \cdot B^T 
\langle A, B \rangle = \sum_{i,j} A_{ij}B_{ij} = \sum_{i=1}^{n} C_{ii}=tr(C)=tr(A \cdot B^T) = tr(B^T \cdot A)
\langle \mathbf{x},\mathbf{y} \rangle=tr(\mathbf{x}^T \mathbf{y})= \mathbf{x}^T \mathbf{y} = tr(\mathbf{y} \mathbf{x}^T)

Function

linear function $f(x)=Ax+b$, $\mathbf{x} \in \mathbb{R}^d$, $A \in \mathbb{R}^{n \times d}$, $\mathbf{b} \in \mathbb{n}$.

The derivative of a function of one variable is itself a function of one variable– it simply is (roughly) defined as the linearization of a function.

f(x): \mathbb{R} \rightarrow \mathbb{R}, f'(x)=\lim_{x\rightarrow 0} \frac{f(x+h)-f(x)}{h}
f(x+h)=f(x)+f'(x)h+o(\vert h\vert)
df=f'(x)dx

$dx$ and $dy$ are called infinitesimals.

In the multivariable case, what $h \rightarrow 0$ means is less clear, as there are many directions in which one could approach a point in $\mathbb{R}^n$.

Given a vector $\mathbf{d}$ with the same dimension as $\mathbf{x}$, we could consider the limit

\nabla f(\mathbf{x})[\mathbf{d}]:=\lim _{t \rightarrow 0} \frac{f(\mathbf{x}+t \mathbf{d})-f(\mathbf{x})}{t},

which may be thought of as a function of both $\mathbf{x}$ and $\mathbf{d}$. If we want a definition for the multidimensional derivative $\frac{d f}{d \mathbf{x}}$ at a given point $\mathbf{x}$, it should not depend on $d$.

It turns out, assuming that the function $f$ is differentiable, that there exists a vector $\nabla f$ such that $\nabla f(\mathbf{x})[\mathbf{d}]=\nabla f(\mathbf{x}) \cdot \mathbf{d}$ for all $\mathbf{d} \in \mathbb{R}^n$, allowing us to separate the direction $\ \mathbf{d}$ and the actual multidimensional derivative.

In particular, the expression for this $\nabla f(\mathbf{x})$ that satisfies the above property is

\nabla f(\mathbf{x})=[\begin{array}{llll}
\frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \ldots & \frac{\partial f}{\partial x_n}
\end{array}] .

This gets the name "gradient" as it represents the set of slopes around a point as one moves one unit in each dimension parallel to the $n$ axes.


The gradient vector represents the derivative for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$. If the function is differentiable, the gradient is equal to the $1 \times n$ vector where the $i$ th entry is $\left[\frac{\partial f}{\partial \mathbf{x}}\right]_i=\frac{\partial f}{\partial x_i}$.

Next, we turn to functions $\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m$ where both the input and output are vectors. We treat the gradient vectors for each entry separately.

As we have defined the gradient for a single-variable function as a row vector, for a function with vector output we could stack these $m$ row vectors on top of one another to get an $m \times n$ matrix.

This matrix is called the Jacobian.

\left[\begin{array}{c}
\frac{d f_1}{d \mathbf{x}} \\
\frac{d f_2}{d \mathbf{x}} \\
\vdots \\
\frac{d f_m}{d \mathbf{x}}
\end{array}\right]=\left[\begin{array}{c}
\nabla f_1(\mathbf{x}) \\
\nabla f_2(\mathbf{x}) \\
\vdots \\
\nabla f_m(\mathbf{x})
\end{array}\right]=\left[\begin{array}{cccc}
\frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{array}\right] .

This definition allows us to extend the limit definition of a multivariable derivative to the Jacobian, as it only involves stacking gradients:

\lim _{t \rightarrow 0} \frac{\mathbf{f}(\mathbf{x}+t \mathbf{d})-\mathbf{f}(\mathbf{x})}{t}=\nabla \mathbf{f}(\mathbf{x})[\mathbf{d}]=\left[\begin{array}{c}
\nabla f_1(\mathbf{x}) \\
\nabla f_2(\mathbf{x}) \\
\vdots \\
\nabla f_m(\mathbf{x})
\end{array}\right] \cdot \mathbf{d}=J_f \cdot \mathbf{d}

Example:

We can decompose function of multidimensional output:

f_i(\mathbf{x})=\sum_{j=1}^{d} A_{ij}x_j + b_i
g_{ij}(h)=f_i(\mathbf{x}+\mathbf{e}_j h)-f_i(\mathbf{x})
g_{ij}'(h)=\lim_{h \rightarrow 0} \frac{A_{ij}h}{h}=A_{ij}
\frac{\partial f}{\partial x}=[\frac{\partial f_i}{\partial x_j}]_{ij}
\frac{\partial f}{\partial x}=A

The Jacobian matrix represents the derivative for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$. It is defined as the $m \times n$ matrix where the term at the $i t h$ row and $j$ th column is $\left[\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\right]_{\mathrm{ij}}=\frac{\partial f_i}{\partial x_j}$.


If we consider the gradient to be a function itself, as in $\nabla f: \mathbb{R}^n \rightarrow \mathbb{R}^n$, transpose into a column vector, and then taking the Jacobian of the transpose. Transposing again gives the Hessian $H_f$ :

\mathbf{H}_f(\mathbf{x})=\mathbf{J}_f\left((\nabla f(\mathbf{x}))^T\right)^T
\mathbf{H}_f(\mathbf{x})=\left[\begin{array}{cccc}
\frac{\partial^2 f(x)}{\partial x_2^2} & \frac{\partial^2 f(x)}{\partial x_2 \partial x_2} & \ldots & \frac{\partial^2 f(x)}{\partial x_1 \partial x_n} \\
\frac{\partial^2 f(x)}{\partial x_2 \partial x_1} & \frac{\partial^2 f(x)}{\partial x_2^2} & \ldots & \frac{\partial^2 f(x)}{\partial x_2 \partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f(x)}{\partial x_n \partial x_1} & \frac{\partial^2 f(x)}{\partial x_n \partial x_2} & \ldots & \frac{\partial^2 f(x)}{\partial x_n^2}
\end{array}\right]=\left[\begin{array}{c}
\frac{\partial}{\partial \mathbf{x}}[\nabla f(\mathbf{x})]_1 \\
\frac{\partial}{\partial \mathbf{x}}[\nabla f(\mathbf{x})]_2 \\
\vdots \\
\frac{\partial}{\partial \mathbf{x}}[\nabla f(\mathbf{x})]_n
\end{array}\right]^T=J_f\left((\nabla f(\mathbf{x})^T)^T\right.

The Hessian matrix, denoted $Hf$ represents the second derivative for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$. It is defined as the $n \times n$ matrix where the term at the $i$ th row and $j$ th column is $\left[\frac{\partial^2 f}{\partial \mathbf{x} \partial \vec{x}^{\top}}\right]{i j}=\frac{\partial^2 f}{\partial x_i \partial x_j}$.


The Hessian matrix is symmetric because $\frac{\partial^2 f}{\partial x_i \partial x_j}=\frac{\partial^2 f}{\partial x_j \partial x_i}$ subject to certain analytic conditions that are satisfied by most continuous functions used in statistics.

ChufanSuki commented 1 month ago

think about matrices holistically—not just as a table of numbers

differentials as linearization

Derivative

Consider $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$. Generalizing from the limit definition of a derivative, we could write the linear approximation form

\Delta \mathbf{f}=\left(\frac{d \mathbf{f}}{d \mathbf{x}}\right) \cdot \Delta \mathbf{x}

As $\Delta \mathbf{f}$ is an $m \times 1$ vector and $\Delta \mathbf{x}$ is an $n \times 1$ vector, a single expression for $\frac{d \mathbf{f}}{d \mathbf{x}}$ will be an $m \times n$ matrix.

However, we might want to make $\Delta f$ be the dot product of $\frac{df}{d\mathbf{x}}$ and $\Delta x$, as this is typically how vectors are multiplied.

\Delta \mathbf{f}=\left(\frac{d \mathbf{f}}{d \mathbf{x}}\right)^T \cdot \Delta \mathbf{x}

As $\left(\frac{d \mathbf{f}}{d \mathbf{x}}\right)^T$ has dimension $m \times n,\left(\frac{d \mathbf{f}}{d \mathbf{x}}\right)$ has dimension $n \times m$.

The two possible equations representing the same concept is the basis for the two multivariable differentiation layouts.

Numerator layout:

\frac{d \mathbf{f}}{d \mathbf{x}}=\left[\begin{array}{llll}
\frac{\partial \mathbf{f}}{\partial x_1} & \frac{\partial \mathbf{f}}{\partial x_2} & \cdots & \frac{\partial \mathbf{f}}{\partial x_n}
\end{array}\right] .
\frac{d \mathbf{f}}{d \mathbf{x}}=\left[\begin{array}{cccc}
\frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{array}\right]=\left[\frac{\partial f_i}{\partial x_j}\right]_{1 \leq i \leq m, 1 \leq j \leq n} .

Denominator layout:

\frac{d \mathbf{f}}{d \mathbf{x}}=\left[\begin{array}{llll}
\frac{d f_1}{d \mathbf{x}} & \frac{d f_2}{d \mathbf{x}} & \cdots & \frac{d f_m}{d \mathbf{x}}
\end{array}\right]
\frac{\partial \mathbf{f}}{\partial \mathbf{x}}=\left[\begin{array}{cccc}
\frac{\partial f_1}{\partial x_1} & \frac{\partial f_2}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_1} \\
\frac{\partial f_1}{\partial x_2} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_2} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial f_1}{\partial x_n} & \frac{\partial f_2}{\partial x_n} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{array}\right]

There is no one correct layout as it is possible to derive internally consistent composition rules for both layouts.

Constant Product

  1. $\frac{d(a u)}{d \mathbf{X}}=a \frac{d u}{d \mathbf{X}}$ if $a$ is constant with respect to $\mathbf{X}$.
  2. $\frac{d \mathbf{A} \mathbf{u}}{d \mathbf{x}}=\mathbf{A} \frac{d \mathbf{u}}{d \mathbf{x}}$ if $\mathbf{A}$ is a constant w.r.t $\mathbf{x}$.
  3. $\frac{d \mathbf{A U B}}{d x}=\mathbf{A} \frac{d \mathbf{U}}{d x} \mathbf{B}$ if $\mathbf{A}, \mathbf{B}$ are constants with respect to $x$.

Addition Rule

  1. $\frac{d(u+v)}{d \mathbf{X}}=\frac{d u}{d \mathbf{X}}+\frac{d v}{d \mathbf{X}}$
  2. $\frac{d(\mathbf{u}+\mathbf{v})}{d \mathbf{x}}=\frac{d \vec{u}}{d \mathbf{x}}+\frac{d \mathbf{v}}{d \mathbf{x}}$
  3. $\frac{d(\mathbf{U}+\mathbf{V})}{d x}=\frac{d \mathbf{U}}{d x}+\frac{d \mathbf{V}}{d x}$

Product Rule

  1. $\frac{d u v}{d \mathbf{X}}=u \frac{d v}{d \mathbf{X}}+v \frac{d u}{d \mathbf{X}}$
  2. $\frac{d \mathbf{u}^T \mathbf{v}}{d \mathbf{x}}=\mathbf{u}^T \frac{d \mathbf{v}}{d \mathbf{x}}+\mathbf{v}^T \frac{d \mathbf{u}}{d \mathbf{x}}$
  3. $\frac{d \mathbf{U V}}{d x}=\mathbf{U} \frac{d \mathbf{V}}{d x}+\frac{d \mathbf{U}}{d x} \mathbf{V}$

Chain Rule

Numerator layout:

  1. $\frac{d \mathbf{y}}{d \mathbf{X}}=\frac{d \mathbf{y}}{d \mathbf{u}} \cdot \frac{d \mathbf{u}}{d \mathbf{X}}$
  2. $\frac{d \mathbf{y}}{d \mathbf{X}}=\frac{d \mathbf{y}}{d \mathbf{u}} \cdot \frac{d \mathbf{u}}{d \mathbf{X}}$
  3. $\frac{d \mathbf{Y}}{d \mathbf{x}}=\frac{d \mathbf{Y}}{d \mathbf{u}} \cdot \frac{d \mathbf{u}}{d \mathbf{x}}$

Differential Product Rule

Let $A, B$ be two matrices. Then, we have the differential product rule for $A B$ :

d(A B)=(d A) B+A(d B) .

By the differential of the matrix $A$, we think of it as a small (unconstrained) change in the matrix $A$.


By the product rule, we have

  1. $d\left(u^T v\right)=(d u)^T v+u^T(d v)=v^T d u+u^T d v$ since dot products commute.
  2. $d\left(u v^T\right)=(d u) v^T+u(d v)^T$.

Derivatives as Linear Operators

From the perspective of linear algebra, given a function $f$, we consider the differential of $f$ to be the linear operator such that

d f=f(x+d x)-f(x)=f^{\prime}(x)[d x] .

$dx$ representing an arbitrary small change in $x$.

Recall that a linear operator is a map $L$ from a vector $v$ in vector space $V$ to a vector $L[v]$ (sometimes denoted simply $L v$ ) in some other vector space. Specifically, $L$ is linear if

L\left[v_1+v_2\right]=L v_1+L v_2 \text { and } L[\alpha v]=\alpha L[v]

for scalars $\alpha \in \mathbb{R}$.

$f: \mathbb{R}^n \rightarrow \mathbb{R}$ then

d f=f(\mathbf{x}+d \mathbf{x})-f(\mathbf{x})=f^{\prime}(\mathbf{x}) d \mathbf{x}=\text { scalar. }

The linear operator $f^{\prime}(\mathbf{x})$ that produces a scalar $d f$ must be a row vector (a "1-row matrix", or more formally something called a covector or "dual" vector or "linear form")! We call this row vector the transpose of the gradient $(\nabla f)^T$, so that $d f$ is the dot product of $d x$ with the gradient. So we have that

d f=\nabla f \cdot d \mathbf{x}=\underbrace{(\nabla f)^T}_{f^{\prime}(\mathbf{x})} d \mathbf{x} .
\nabla f=\left[\begin{array}{c}
\frac{\partial f}{\partial x_1} \\
\frac{\partial f}{\partial x_2} \\
\vdots \\
\frac{\partial f}{\partial x_n}
\end{array}\right]

or, equivalently,

d f=f(\mathbf{x}+d \mathbf{x})-f(\mathbf{x})=\nabla f \cdot d \mathbf{x}=\frac{\partial f}{\partial x_1} d x_1+\frac{\partial f}{\partial x_2} d x_2+\cdots+\frac{\partial f}{\partial x_n} d x_n .

$f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ then

\underbrace{d f}_{m \text { components }}=\underbrace{f^{\prime}(\mathbf{x})}_{m \times n} \underbrace{d \mathbf{x}}_{n \text { components }},

so $f^{\prime}(\mathbf{x})$ must be an $m \times n$ matrix, called the Jacobian of $f$.

The Jacobian matrix $J$ represents the linear operator that takes $dx$ to $df$

df=J d\mathbf{x}