Open Blunde1 opened 9 months ago
To second order, quite generally we have $$IC(\theta) = tr\left(E\left[\nabla_\theta^2 l(u;\hat{\theta})\right] Cov(\hat{\theta})\right)$$ The sample average to replace $Cov(\hat{\theta})$ is not the best estimator. In fact the "trace inner product" induces the Frobenius norm as a measure, and there exists results on adaptive inflation to improve the estimator under this norm.
It is not exactly the sample estimator that is employed for $Cov(\hat{\theta})$ but rather the Delta method using the sample covariance for $Cov(\nabla_\theta l(u;\hat{\theta}))$. The argument on "best estimator" above still applies. This is particularly relevant for $p>>n$ but a global maximum exists (i.e. under L-2 regularization of the objective).
Note, this (implicitly) answers the question in https://www.tandfonline.com/doi/abs/10.1198/000313006X152207 on why practitioners should care about parameter variance while ignoring bias.
The loss should be using the triangular structure, and thus additive nature of the log-likelihood. For each sub log-likelihood, we may add the information criterion component. I.e., we seek to evaluate $$l(u;\hat{\Lambda})=\sum_j l(u_j;\hat{C}_j)$$
and to evaluate $$E[l(u{test};\hat{\Lambda})]$$ as $$E[l(u{test};\hat{\Lambda})]\approx \sumj l(u{j,~ train};\hat{C}_j) + IC(\hat{C}_j)$$ where we may try for $IC(\hat{C}_j)$ the
All of the above employs asymptotic results. Is it possible to use e.g. the bootstrap (or the bootstrap in the frequentist domain) to alleviate these assumptions for when $n$ is small?