Closed ChufanSuki closed 1 month ago
think about matrices holistically—not just as a table of numbers
differentials as linearization
Consider $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$. Generalizing from the limit definition of a derivative, we could write the linear approximation form
\Delta \mathbf{f}=\left(\frac{d \mathbf{f}}{d \mathbf{x}}\right) \cdot \Delta \mathbf{x}
As $\Delta \mathbf{f}$ is an $m \times 1$ vector and $\Delta \mathbf{x}$ is an $n \times 1$ vector, a single expression for $\frac{d \mathbf{f}}{d \mathbf{x}}$ will be an $m \times n$ matrix.
However, we might want to make $\Delta f$ be the dot product of $\frac{df}{d\mathbf{x}}$ and $\Delta x$, as this is typically how vectors are multiplied.
\Delta \mathbf{f}=\left(\frac{d \mathbf{f}}{d \mathbf{x}}\right)^T \cdot \Delta \mathbf{x}
As $\left(\frac{d \mathbf{f}}{d \mathbf{x}}\right)^T$ has dimension $m \times n,\left(\frac{d \mathbf{f}}{d \mathbf{x}}\right)$ has dimension $n \times m$.
The two possible equations representing the same concept is the basis for the two multivariable differentiation layouts.
Numerator layout:
\frac{d \mathbf{f}}{d \mathbf{x}}=\left[\begin{array}{llll}
\frac{\partial \mathbf{f}}{\partial x_1} & \frac{\partial \mathbf{f}}{\partial x_2} & \cdots & \frac{\partial \mathbf{f}}{\partial x_n}
\end{array}\right] .
\frac{d \mathbf{f}}{d \mathbf{x}}=\left[\begin{array}{cccc}
\frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{array}\right]=\left[\frac{\partial f_i}{\partial x_j}\right]_{1 \leq i \leq m, 1 \leq j \leq n} .
Denominator layout:
\frac{d \mathbf{f}}{d \mathbf{x}}=\left[\begin{array}{llll}
\frac{d f_1}{d \mathbf{x}} & \frac{d f_2}{d \mathbf{x}} & \cdots & \frac{d f_m}{d \mathbf{x}}
\end{array}\right]
\frac{\partial \mathbf{f}}{\partial \mathbf{x}}=\left[\begin{array}{cccc}
\frac{\partial f_1}{\partial x_1} & \frac{\partial f_2}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_1} \\
\frac{\partial f_1}{\partial x_2} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_2} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial f_1}{\partial x_n} & \frac{\partial f_2}{\partial x_n} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{array}\right]
There is no one correct layout as it is possible to derive internally consistent composition rules for both layouts.
Constant Product
Addition Rule
Product Rule
Chain Rule
Numerator layout:
Differential Product Rule
Let $A, B$ be two matrices. Then, we have the differential product rule for $A B$ :
d(A B)=(d A) B+A(d B) .
By the differential of the matrix $A$, we think of it as a small (unconstrained) change in the matrix $A$.
By the product rule, we have
From the perspective of linear algebra, given a function $f$, we consider the differential of $f$ to be the linear operator such that
d f=f(x+d x)-f(x)=f^{\prime}(x)[d x] .
$dx$ representing an arbitrary small change in $x$.
Recall that a linear operator is a map $L$ from a vector $v$ in vector space $V$ to a vector $L[v]$ (sometimes denoted simply $L v$ ) in some other vector space. Specifically, $L$ is linear if
L\left[v_1+v_2\right]=L v_1+L v_2 \text { and } L[\alpha v]=\alpha L[v]
for scalars $\alpha \in \mathbb{R}$.
$f: \mathbb{R}^n \rightarrow \mathbb{R}$ then
d f=f(\mathbf{x}+d \mathbf{x})-f(\mathbf{x})=f^{\prime}(\mathbf{x}) d \mathbf{x}=\text { scalar. }
The linear operator $f^{\prime}(\mathbf{x})$ that produces a scalar $d f$ must be a row vector (a "1-row matrix", or more formally something called a covector or "dual" vector or "linear form")! We call this row vector the transpose of the gradient $(\nabla f)^T$, so that $d f$ is the dot product of $d x$ with the gradient. So we have that
d f=\nabla f \cdot d \mathbf{x}=\underbrace{(\nabla f)^T}_{f^{\prime}(\mathbf{x})} d \mathbf{x} .
\nabla f=\left[\begin{array}{c}
\frac{\partial f}{\partial x_1} \\
\frac{\partial f}{\partial x_2} \\
\vdots \\
\frac{\partial f}{\partial x_n}
\end{array}\right]
or, equivalently,
d f=f(\mathbf{x}+d \mathbf{x})-f(\mathbf{x})=\nabla f \cdot d \mathbf{x}=\frac{\partial f}{\partial x_1} d x_1+\frac{\partial f}{\partial x_2} d x_2+\cdots+\frac{\partial f}{\partial x_n} d x_n .
$f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ then
\underbrace{d f}_{m \text { components }}=\underbrace{f^{\prime}(\mathbf{x})}_{m \times n} \underbrace{d \mathbf{x}}_{n \text { components }},
so $f^{\prime}(\mathbf{x})$ must be an $m \times n$ matrix, called the Jacobian of $f$.
The Jacobian matrix $J$ represents the linear operator that takes $dx$ to $df$
df=J d\mathbf{x}
Vector and matrix
vector: $\mathbf{x} \in \mathbb{R}^d$
matrix: $A \in \mathbb{R}^{m\times n}$
Function
linear function $f(x)=Ax+b$, $\mathbf{x} \in \mathbb{R}^d$, $A \in \mathbb{R}^{n \times d}$, $\mathbf{b} \in \mathbb{n}$.
The derivative of a function of one variable is itself a function of one variable– it simply is (roughly) defined as the linearization of a function.
$dx$ and $dy$ are called infinitesimals.
In the multivariable case, what $h \rightarrow 0$ means is less clear, as there are many directions in which one could approach a point in $\mathbb{R}^n$.
Given a vector $\mathbf{d}$ with the same dimension as $\mathbf{x}$, we could consider the limit
which may be thought of as a function of both $\mathbf{x}$ and $\mathbf{d}$. If we want a definition for the multidimensional derivative $\frac{d f}{d \mathbf{x}}$ at a given point $\mathbf{x}$, it should not depend on $d$.
It turns out, assuming that the function $f$ is differentiable, that there exists a vector $\nabla f$ such that $\nabla f(\mathbf{x})[\mathbf{d}]=\nabla f(\mathbf{x}) \cdot \mathbf{d}$ for all $\mathbf{d} \in \mathbb{R}^n$, allowing us to separate the direction $\ \mathbf{d}$ and the actual multidimensional derivative.
In particular, the expression for this $\nabla f(\mathbf{x})$ that satisfies the above property is
This gets the name "gradient" as it represents the set of slopes around a point as one moves one unit in each dimension parallel to the $n$ axes.
The gradient vector represents the derivative for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$. If the function is differentiable, the gradient is equal to the $1 \times n$ vector where the $i$ th entry is $\left[\frac{\partial f}{\partial \mathbf{x}}\right]_i=\frac{\partial f}{\partial x_i}$.
Next, we turn to functions $\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m$ where both the input and output are vectors. We treat the gradient vectors for each entry separately.
As we have defined the gradient for a single-variable function as a row vector, for a function with vector output we could stack these $m$ row vectors on top of one another to get an $m \times n$ matrix.
This matrix is called the Jacobian.
This definition allows us to extend the limit definition of a multivariable derivative to the Jacobian, as it only involves stacking gradients:
Example:
We can decompose function of multidimensional output:
The Jacobian matrix represents the derivative for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$. It is defined as the $m \times n$ matrix where the term at the $i t h$ row and $j$ th column is $\left[\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\right]_{\mathrm{ij}}=\frac{\partial f_i}{\partial x_j}$.
If we consider the gradient to be a function itself, as in $\nabla f: \mathbb{R}^n \rightarrow \mathbb{R}^n$, transpose into a column vector, and then taking the Jacobian of the transpose. Transposing again gives the Hessian $H_f$ :
The Hessian matrix, denoted $Hf$ represents the second derivative for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$. It is defined as the $n \times n$ matrix where the term at the $i$ th row and $j$ th column is $\left[\frac{\partial^2 f}{\partial \mathbf{x} \partial \vec{x}^{\top}}\right]{i j}=\frac{\partial^2 f}{\partial x_i \partial x_j}$.
The Hessian matrix is symmetric because $\frac{\partial^2 f}{\partial x_i \partial x_j}=\frac{\partial^2 f}{\partial x_j \partial x_i}$ subject to certain analytic conditions that are satisfied by most continuous functions used in statistics.