Closed multimeric closed 3 years ago
Oops, turns out this is called errors()
, and is already implemented.
Although I think these are better named "residuals", because it's a more precise term that indicates that we can't observe the true value of y.
Yes, I believe the errors()
function is meant to check the difference between the fitted model's prediction and the true y values for an arbitrary dataset, so you can compute the residuals by passing in the training data to errors()
. But there should be a cleaner way of getting the fit residuals directly.
Perhaps residuals_raw()
for (y - E[y|x]) and residuals_std()
for (y - E[y|x])/sqrt(Var[y|x]) ?
I think residuals_response()
for the difference between value and prediction, residuals_pearson()
for the standardized residuals, and residuals_deviance()
for the likelihood-based version is more in line with typical GLM terminology. I do want to refresh on studentized residuals to see if/how that has a place in this interface.
Yep, that would make for a good API. By the way since we're on this topic I wonder if you should provide either a field or a function for the residual sum of squares, and also the related total sum of squares, and regression sum of squares? Happy to write a PR, I'm just not sure what you've already started.
The response and standardized residuals have been implemented in 06ec0b6. I added a hook for the deviance residuals as well although it's currently unimplemented, as it will require refactoring some of the likelihood calculations.
A function for residual sum of squares probably makes sense although perhaps it should only be exposed for linear regression -- I'm not sure it's generally useful for all models. Let's address that in #21.
On a related note, I think perhaps the Fit::errors()
function should be made private. Since it takes a Model
argument rather than data arrays, I don't think this will be generally useful to external users right now and will probably just lead to confusion.
Yep, I think that makes sense.
Deviance residuals are implemented in f4a4686.
Before closing this issue I still want to look further into studentized residuals and determine if this should be enabled by default, as an alternative or external function, or what.
What do you mean by "enabled by default"?
I've been meaning to resolve this issue completely before publishing 0.0.10, but I've been too busy lately to follow through. It's turned out to be a little more complicated than I initially thought; there are about 3-5 different types of residuals and each could either be returned raw, internally studentized, or externally studentized. An interface that is somehow "multiplicative" over this grid of alternatives seems desirable, but I also don't want to make the basic cases overly-verbose to use.
I've pushed several changes to the 0.0.10 branch, including adding several more flavors of residuals. (It took me a while to settle on any scheme because it seems like every reference uses different definitions and naming conventions. I ended up trying to stay close to what's used in R.)
I did change the naming scheme to something less verbose, for instance residuals_pearson()
is now resid_pear()
. If you get a chance to try it out @multimeric feel free to add any comments. Of interest to you specifically might be resid_pear_std()
which standardizes the Pearson residuals by the sample standard deviation and applies a leverage correction for each point.
I think the last issue I'd like to work out here is externally studentized residuals. The full exact calculations for these are expensive, and typically one-step approximations are used. The rstudent
function in R uses another expression that should be even faster but it's not obvious to me how the expression is derived and I'd like to understand it a bit better.
0.0.10 has been published, so all this is now available in the latest release.
I note that there is no
residuals
field or method on theFit
object. I guess I couldfit.expectation(x)
(which by the way is not documented due to #18) and subtracty
, but it might be a bit more user friendly to make a method for this. In particular you could provide the option to output standardised residuals, which is actually what I'm trying to calculate here.