chufanchen / read-paper-and-code

0 stars 0 forks source link

An Empirical Model of Large-Batch Training #71

Open chufanchen opened 7 months ago

chufanchen commented 7 months ago

https://arxiv.org/abs/1812.06162

chufanchen commented 7 months ago

Minimize $L(\theta)$:

$$ L(\theta)=\mathbb{E}_{x\sim \rho}[L_x(\theta)], G(\theta)=\nabla L(\theta) $$

Estimate of the gradient by averaging batch of samples from $\rho(x)$

E.g. SGD, Adam

$$ G{est}(\theta)=\frac{1}{B}\sum{i=1}^{B}\nabla\theta L{x_i}(\theta); x_i \sim \rho $$

The gradient is now a random variable whose expected value is given by the true gradient:

$$ E{x{1 \dots B} \sim \rho}[G_{est}(\theta)] = G(\theta) $$

Its variance scales inversely with the batch size $B$:

$$ cov{x{1 \dots B} \sim \rho}(G_{est}(\theta)) =\frac{1}{B} \Sigma(\theta) $$

$$ \Sigma(\theta) \equiv cov{x\sim \rho} (\nabla\theta Lx(\theta))=\mathbb{E}{x\sim \rho}[(\nabla_\theta Lx(\theta))(\nabla\theta L_x(\theta))^T] -G(\theta)G(\theta)^T $$

The minibatch gradient gives a noisy estimate of the true gradient, and that larger batches give higher quality estimate

chufanchen commented 7 months ago

Let $G$ denote the true gradient and $H$ the true Hessian at parameter values $\theta$. If we perturbthe parameters $\theta$ by some vector $V$ to $\theta − \epsilon V$ , where $\epsilon$ is the step size

$$ L(\theta- \epsilon V) \approx L(\theta) -\epsilon G^TV+\frac{1}{2}\epsilon^2V^THV $$

If we had access to the noiseless true gradient $G$ and used it to perturb the parameter($V=G$), then above equation would be minimized by setting $\epsilon=\epsilon_{max}=\frac{\vert G\vert^2}{G^T H G}$

However, in reality we have access only to the noisy estimated gradient $G{est}$ from a batch of size $B$, thus the best we can do is minimize the expectation $\mathbb{E}[L(\theta-\epsilon G{est}]$ with respect to $\epsilon$.

$$ \mathbb{E}\left[L\left(\theta-\epsilon G_{\text {est }}\right)\right]=L(\theta)-\epsilon|G|^2+\frac{1}{2} \epsilon^2\left(G^T H G+\frac{tr(H \Sigma)}{B}\right) $$quad

Minimizing this equation with respect to $\epsilon$ leads to:

$$ \epsilon{\text {opt }}(B)=argmin\epsilon \mathbb{E}\left[L\left(\theta-\epsilon G{\text {est }}\right)\right]=\frac{\epsilon{\text {max }}}{1+\mathcal{B}_{\text {noise }} / B} $$

as the optimal step size, which produces an optimal improvement in the loss from the noisy gradient:

$$ \Delta L{\text {opt }}(B)=\frac{\Delta L{max }}{1+\mathcal{B}_{\text {noise }} / B} ; $$

$$ \Delta L_{max }=\frac{1}{2} \frac{|G|^4}{G^T H G} $$

Above, we have defined the noise scale as:

$$ \mathcal{B}_{\text {noise }}=\frac{tr(H \Sigma)}{G^T H G} $$