Add convenience function for calculating loss (log-likelihood) over data, also information criterion adjusted

Blunde1 commented 9 months ago

The loss should be using the triangular structure, and thus additive nature of the log-likelihood. For each sub log-likelihood, we may add the information criterion component. I.e., we seek to evaluate $$l(u;\hat{\Lambda})=\sum_j l(u_j;\hat{C}_j)$$

and to evaluate $$E[l(u{test};\hat{\Lambda})]$$ as $$E[l(u{test};\hat{\Lambda})]\approx \sumj l(u{j,~ train};\hat{C}_j) + IC(\hat{C}_j)$$ where we may try for $IC(\hat{C}_j)$ the

AIC: $IC(\hat{C}_j)=ne(j)+1$, easy, standard, and may be computed locally for $\hat{C}_j$ or globally for $\hat{\Lambda}$. Negative: Asymptotics and assumes population precision is in the family of precisions being estimated over.
AICc: $IC(\hat{C}_j)=k + \frac{k^2+k}{n-k-1}$ for $k=ne(j)+1$, easy and relatively standard. May only be computed locally on $\hat{C}_j$. It is an adjustment for small sample sizes but requires $k < n$ still. It also has the same assumptions on the population precision as the AIC.
TIC: $IC(\hat{C}_j)=tr[\nabla^2 l(u_j;\hat{C}_j) (\nabla l(u_j;\hat{C}_j)^2)])$ can be computed locally or globally. Locally makes sense, as we have access to derivatives from the optimization. Avoids the assumptions on population precision.

All of the above employs asymptotic results. Is it possible to use e.g. the bootstrap (or the bootstrap in the frequentist domain) to alleviate these assumptions for when $n$ is small?

Blunde1 commented 9 months ago

To second order, quite generally we have $$IC(\theta) = tr\left(E\left[\nabla_\theta^2 l(u;\hat{\theta})\right] Cov(\hat{\theta})\right)$$ The sample average to replace $Cov(\hat{\theta})$ is not the best estimator. In fact the "trace inner product" induces the Frobenius norm as a measure, and there exists results on adaptive inflation to improve the estimator under this norm.

Blunde1 commented 9 months ago

It is not exactly the sample estimator that is employed for $Cov(\hat{\theta})$ but rather the Delta method using the sample covariance for $Cov(\nabla_\theta l(u;\hat{\theta}))$. The argument on "best estimator" above still applies. This is particularly relevant for $p>>n$ but a global maximum exists (i.e. under L-2 regularization of the objective).

Note, this (implicitly) answers the question in https://www.tandfonline.com/doi/abs/10.1198/000313006X152207 on why practitioners should care about parameter variance while ignoring bias.

equinor / graphite-maps

Add convenience function for calculating loss (log-likelihood) over data, also information criterion adjusted #7