EHWUSF / HS68_2018_Project_1

0 stars 9 forks source link

Testing/Checking Linear Regression assumptions and Evaluating the Model #12

Open rohitchadaram opened 6 years ago

rohitchadaram commented 6 years ago

The basic idea is to build a module which tests the basic assumptions of linear regression, then calculates the metrics like (R-squared,F-Statistic,etc) to evaluate if the data set has a good linear regression fit.

Outline : Build a Data Viz sub module which spits you all the plots relevant to testing the linear regression assumptions vs a normal looking plot then in next sub module the idea is : There are multiple methods which check/evaluate a linear regression individually but none which does it in entirety and gives us a final metric which helps us take an informed decision. My idea is to build a new metric which evaluates the model and uses a harmonic mean of some of the metrics to output a final value. Again this is not fool-proof solution it has its pitfalls this can be used interim with the other features that are being developed like Data Viz, Feature selection to arrive at a accurate/appropriate model.

The important sub-features of this proposal are :

1) Build a way to check for the assumptions of linear regression : Linearity of residuals,Independence of residuals,Normal distribution of residuals,Equal variance of residuals. 2) Build a new metric from the existing metrics which evaluates a linear reg model.

This module is a one-stop place to check_evaluate the data set for Linear regression.

douglas-yao commented 5 years ago

I like this idea, did you have an idea of which metrics you wanted to combine, and in what way?

rohitchadaram commented 5 years ago

Thank you for the query. Yes, I am planning to combine R-squared,F-Statistic and p-value three main values which you seek out to report/analyze your model for linear regression in general. Also I chose harmonic mean for the reason being usually when you have a wide range of values from low decimal values( can be 0) to very high values( can be infinity), taking a arithmetic mean to report doesn't make sense as it is skewed,also you will end with cases when the reported metric value is infinity(undefined) that was my idea behind wanting to use harmonic mean as it is the reciprocal of arithmetic mean of reciprocals so it is devoid of any bias of large/small value variations.

So the idea was F-statistic usual range(0 to infinity),R-squared(-1 to +1) was to be combined in a way which represent the strength of model : multiply both the values as large values for both indicate a good model. Now take a harmonic mean of this value with the reciprocal of p-value( as small p-value indicates the data is a good fit compared to a model with only intercept or mean). This metric with a good enough value can be used to differentiate a good/bad model.

So this is a very crude attempt to combine two aspects of metrics reported : The fit of a model and amount of variance explained and try to come up with a new metric for reporting. The new metric values obtained can be judged by comparing with threshold values and the threshold values are calculated by putting in the edge case values for F-statistic,R-squared and p-value. Hope this makes sense.

douglas-yao commented 5 years ago

Makes perfect sense. Looking forward to the final range and accuracy of the metric, sounds like a great at-a-glance metric for fit.