felix-clark / ndarray-glm

Rust library for linear, logistic, and generalized linear model regression
MIT License
22 stars 0 forks source link

Simple way to extract residuals? #19

Closed multimeric closed 3 years ago

multimeric commented 3 years ago

I note that there is no residuals field or method on the Fit object. I guess I could fit.expectation(x) (which by the way is not documented due to #18) and subtract y, but it might be a bit more user friendly to make a method for this. In particular you could provide the option to output standardised residuals, which is actually what I'm trying to calculate here.

multimeric commented 3 years ago

Oops, turns out this is called errors(), and is already implemented.

multimeric commented 3 years ago

Although I think these are better named "residuals", because it's a more precise term that indicates that we can't observe the true value of y.

felix-clark commented 3 years ago

Yes, I believe the errors() function is meant to check the difference between the fitted model's prediction and the true y values for an arbitrary dataset, so you can compute the residuals by passing in the training data to errors(). But there should be a cleaner way of getting the fit residuals directly.

Perhaps residuals_raw() for (y - E[y|x]) and residuals_std() for (y - E[y|x])/sqrt(Var[y|x]) ?

felix-clark commented 3 years ago

I think residuals_response() for the difference between value and prediction, residuals_pearson() for the standardized residuals, and residuals_deviance() for the likelihood-based version is more in line with typical GLM terminology. I do want to refresh on studentized residuals to see if/how that has a place in this interface.

multimeric commented 3 years ago

Yep, that would make for a good API. By the way since we're on this topic I wonder if you should provide either a field or a function for the residual sum of squares, and also the related total sum of squares, and regression sum of squares? Happy to write a PR, I'm just not sure what you've already started.

felix-clark commented 3 years ago

The response and standardized residuals have been implemented in 06ec0b6. I added a hook for the deviance residuals as well although it's currently unimplemented, as it will require refactoring some of the likelihood calculations.

A function for residual sum of squares probably makes sense although perhaps it should only be exposed for linear regression -- I'm not sure it's generally useful for all models. Let's address that in #21.

felix-clark commented 3 years ago

On a related note, I think perhaps the Fit::errors() function should be made private. Since it takes a Model argument rather than data arrays, I don't think this will be generally useful to external users right now and will probably just lead to confusion.

multimeric commented 3 years ago

Yep, I think that makes sense.

felix-clark commented 3 years ago

Deviance residuals are implemented in f4a4686.

Before closing this issue I still want to look further into studentized residuals and determine if this should be enabled by default, as an alternative or external function, or what.

multimeric commented 3 years ago

What do you mean by "enabled by default"?

felix-clark commented 3 years ago

I've been meaning to resolve this issue completely before publishing 0.0.10, but I've been too busy lately to follow through. It's turned out to be a little more complicated than I initially thought; there are about 3-5 different types of residuals and each could either be returned raw, internally studentized, or externally studentized. An interface that is somehow "multiplicative" over this grid of alternatives seems desirable, but I also don't want to make the basic cases overly-verbose to use.

felix-clark commented 3 years ago

I've pushed several changes to the 0.0.10 branch, including adding several more flavors of residuals. (It took me a while to settle on any scheme because it seems like every reference uses different definitions and naming conventions. I ended up trying to stay close to what's used in R.)

I did change the naming scheme to something less verbose, for instance residuals_pearson() is now resid_pear(). If you get a chance to try it out @multimeric feel free to add any comments. Of interest to you specifically might be resid_pear_std() which standardizes the Pearson residuals by the sample standard deviation and applies a leverage correction for each point.

I think the last issue I'd like to work out here is externally studentized residuals. The full exact calculations for these are expensive, and typically one-step approximations are used. The rstudent function in R uses another expression that should be even faster but it's not obvious to me how the expression is derived and I'd like to understand it a bit better.

felix-clark commented 3 years ago

0.0.10 has been published, so all this is now available in the latest release.