weighted least squares linear regression

Jay4869 commented 8 years ago

Since in our dataset, the variance of blood measurement is constant (the variance of error is not constant), I try two methods to fix the issue in order to improve our accuracy for our model and further research

variance of Y not constant: If the variance of the Y is not constant, then the the error variance will not be constant. The most common form of such heteroscedasticity in Y is that the variance of Y may increase as the mean of Y increases, for data with positive X and Y. Unless the heteroscedasticity of the Y is pronounced, its effect will not be severe: the least squares estimates will still be unbiased, and the estimates of the slope and intercept will either be normally distributed if the errors are normally distributed, or at least normally distributed asymptotically (as the number of data points becomes large) if the errors are not normally distributed. The estimate for the variance of the slope and variance will be inaccurate, but the inaccuracy is not likely to be substantial if the X values are symmetric about their mean. Heteroscedasticity of Y is usually detected informally by examining the X-Y scatterplot of the data before performing the regression. If both nonlinearity and unequal variances are present, employing a transformation of Y may have the effect of simultaneously improving the linearity and promoting equality of the variances. Otherwise, a weighted least squares linear regression may be the preferred method of dealing with nonconstant variance of Y.

Resource http://www.basic.northwestern.edu/statguidefiles/linreg_ass_viol.html

Jay4869 commented 8 years ago

@matthew-brett @jarrodmillman I need a little help on white test which hypothesis testing heteroscedasticity. The first link is what I am trying to do. And, the second link is about python white test, but I am confused on the function returns. Do you have any knowledge about it? Should I just look lm_pvalue which if greater than 0.05, then we reject?

resource: http://www.econ.uiuc.edu/~wsosa/econ471/GLSHeteroskedasticity.pdf (Page 4) http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.diagnostic.het_white.html#statsmodels.stats.diagnostic.het_white

Jay4869 commented 8 years ago

It is done at this point

matthew-brett commented 8 years ago

Thanks for letting us know, and closing.

Jay4869 commented 8 years ago

@matthew-brett Do you have any background about statmodels.stat white test? Could you look up if you have time?

matthew-brett commented 8 years ago

I think that function does what the presentation is describing. The inputs are:

the residuals from the OLS analysis (B = pinv(X).dot(Y); E = Y - X.dot(B));
the design X

The first two outputs are:

n R^2 (as described in the presentation) - see : https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/stats/diagnostic.py#L665;
p value for n R^2 from the chi-squared distribution - see: https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/stats/diagnostic.py#L671

I think that the statsmodels implementation is doing some fancy numpy stuff to construct the "auxilliary" design (including all the squares of regressors, cross-products) from the original design - that I have written as X.

matthew-brett commented 8 years ago

For the construction of the auxiliary design from the input design X, see : https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/stats/diagnostic.py#L658

This is some tricky numpy stuff to make a new design that is the product of each column with every other column of the input design X (or x in that code).

Jay4869 commented 8 years ago

@matthew-brett Thank you for explanation. I figured out all the input of the white_test function, and just wanna make sure the outputs. Yes, I am looking for n_R^2 and p value for n_R^2, which means the significant level of heteroscedasticity. Thanks again for your time!

berkeley-stat159 / project-iota

weighted least squares linear regression #90