JuliaStats / Survival.jl

Survival analysis in Julia
MIT License
73 stars 23 forks source link

Store only the covariance matrix in `CoxModel`, not the information #41

Open ararslan opened 2 years ago

ararslan commented 2 years ago

Currently we're storing both the (observed) Fisher information matrix and its inverse, the variance-covariance matrix, separately in the CoxModel object. We aren't actually using the information matrix for anything though, so we can save space by storing only the covariance matrix. If folks need the information matrix (and indeed, if we extend informationmatrix(::CoxModel; expected=false)), we can take the inverse as needed.

Alternatively, we could store a Cholesky decomposition of the information (or covariance), which is what this PR originally did.

Additionally, on main, the covariance matrix is obtained from the observed information via pinv, which allows obtaining a covariance matrix even in the case of rank deficiency in the model matrix, which doesn't really make sense. Instead, we can check for rank deficiency ahead of time and error out early before even attempting to fit a model.

codecov-commenter commented 2 years ago

Codecov Report

Base: 99.63% // Head: 99.64% // Increases project coverage by +0.00% :tada:

Coverage data is based on head (e2810db) compared to base (12164f5). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #41 +/- ## ======================================= Coverage 99.63% 99.64% ======================================= Files 6 6 Lines 276 283 +7 ======================================= + Hits 275 282 +7 Misses 1 1 ``` | [Impacted Files](https://codecov.io/gh/JuliaStats/Survival.jl/pull/41?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats) | Coverage Δ | | |---|---|---| | [src/cox.jl](https://codecov.io/gh/JuliaStats/Survival.jl/pull/41/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats#diff-c3JjL2NveC5qbA==) | `99.31% <100.00%> (+0.03%)` | :arrow_up: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaStats)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

andreasnoack commented 2 years ago

I'm in favor of this but I'm wondering if we should just use Cholesky. We only use this field for computing the uncertainty after the estimation, right, and the inverse calculation will fail in the singular case.

ararslan commented 2 years ago

We only use this field for computing the uncertainty after the estimation, right

AFAICT the Fisher information field isn't meaningfully used anywhere, but the covariance matrix field is indeed used for uncertainty. The fields store the Hessian and its inverse after fitting, they aren't being passed anywhere during fitting.

and the inverse calculation will fail in the singular case.

The covariance matrix is currently being computed via pinv, so we do get a result even in the singular case.

andreasnoack commented 2 years ago

But with this PR it would fail, no? I'm not sure using pinv is sound when computing a variance estimate so failing might be better.

ararslan commented 2 years ago

But with this PR it would fail, no?

No, at least as written it would use a pivoted Cholesky factorization.

I'm not sure using pinv is sound when computing a variance estimate so failing might be better.

Yeah I agree, not sure why it uses pinv currently (though I guess it's likely specifically to handle this case). It would be nice to be able to determine that case ahead of time to throw an error early and avoid fitting the model altogether if possible.

ararslan commented 2 years ago

Would it be enough to add something like

rank(M) < min(size(M)...) && throw(ArgumentError("model matrix is not full rank"))

to the fit(CoxModel, ...) method?

Your point about only using the field for uncertainty is a very good one; I wonder whether it's even worth storing the decomposition of the Fisher information since we primarily care about the the covariance matrix. We could store that instead, either dense or factorized. Thoughts on that?

ararslan commented 2 years ago

Based on our discussion, the title of the PR is now inaccurate and (I've updated that and the OP) what's implemented is the following:

Let me know what you think, @andreasnoack!