MoreeZ / sweng-2022

Regression/Classification/Deep learning based models. We take any regression or classification use case and do our predictions. This is easier said than done. We need to follow all the assumptions of the algorithm that we use. We need to use various algorithms, stacking models. tons of parameter tunings and model use various model evaluation tests. I personally would go for regression model. Note: In all the models, based on the use case chosen a dashboard needs to be created.
2 stars 2 forks source link

First model : Linear regression #31

Open mingweiiiiiiiiii opened 2 years ago

mingweiiiiiiiiii commented 2 years ago

Our Mean square value is 0.004099985307764925 The visualization graph is the backend channel (First model jupyter notebook) @1978abhay

mingweiiiiiiiiii commented 2 years ago

Note MSE on test dataset not training dataset .That is very important

mingweiiiiiiiiii commented 2 years ago

Our MSE is better than most online solution on kaggle or other websites .

1978abhay commented 2 years ago

I would love to see the model diagnostic ressults. Search on Regression diagnostic visualisation python. Also share the total rows and the rows you used in the training set. Then share the model results. Would love to check the feature significance.

mingweiiiiiiiiii commented 2 years ago

Thanks for your hint and you help our project a lot.

mingweiiiiiiiiii commented 2 years ago

image Diagnotics picture

mingweiiiiiiiiii commented 2 years ago

image

Fit regression line

mingweiiiiiiiiii commented 2 years ago

I used the 2016 dataset,after removing operation . The total rows are 88465,the size of training data set is 61925

mingweiiiiiiiiii commented 2 years ago
image

Feature importance ,I use the first two features .And then adding the third biggest feature,there have P >[t} >0.05 Not Signficance

mingweiiiiiiiiii commented 2 years ago
image

VIF test of these two feature

mingweiiiiiiiiii commented 2 years ago
image

Mean square error on test dataset for linear regression model

mingweiiiiiiiiii commented 2 years ago
image
mingweiiiiiiiiii commented 2 years ago
image

residual only for finishing living area feature

mingweiiiiiiiiii commented 2 years ago
image

Taxamount residual

mingweiiiiiiiiii commented 2 years ago
image

Pairplot

mingweiiiiiiiiii commented 2 years ago
image

Q-Q plot for y predict

mingweiiiiiiiiii commented 2 years ago
image

Q-q plot for y-test(actual log error )

mingweiiiiiiiiii commented 2 years ago
image

These two features have relative largest variance from describe() command in sklearn and top 2 features from random forest regression

mingweiiiiiiiiii commented 2 years ago

@1978abhay

mingweiiiiiiiiii commented 2 years ago

Hi @1978abhay : I have uploaded the picture that you mentioned today ,it would be great that you could check Thanks very much. Kind regards Mingwei

mingweiiiiiiiiii commented 2 years ago
image

Correlation matrix

1978abhay commented 2 years ago
  1. How did you decide on the two featuress to be included in the model i.e. taxamount and finishedsquarefeet12? There are a lot of good features in the feature importance.
  2. Your RSq is .005 which is too small. The model is underfitting a lot which means it has a lot of bias.
  3. The Rsq value tells me that these two features aren't enough.
  4. Have you done any imputation, data transformation, removing of highly correlated features, removal of outliers?
mingweiiiiiiiiii commented 2 years ago

1 If I add more feature, it would add VIF value >10 and sometimes P value >0.05 for some features 2 But no matter how many features I add,there is still the same fitting line since you could see from the original graph .The original line is very dense at middle and sparse at the both side 3 I do not how to do this for this 4 I use K-nearest neighbor to do the data imputation, remove the outlier for the target variable that (>Zscore 3.0) to make it normal. Privoss times I used Box-cox to do the data transformation,but it make some feature wired,for example .The TAXAMOUNT from 6300 to 630 .Shrnking 10 times.

Since I didn't know how to do it the next step ,which part fo you want me to demonstrate ? I could demonstrate and might be find some error about it . Thanks .

mingweiiiiiiiiii commented 2 years ago

For 4,I use box-cox to do the transformation ,it make data that cannot detect the outililer on test dataset

mingweiiiiiiiiii commented 2 years ago

image Latest feature importance

mingweiiiiiiiiii commented 2 years ago
image

Latest feature importance

mingweiiiiiiiiii commented 2 years ago
image

Updated a little bit R squ. @1978abhay

mingweiiiiiiiiii commented 2 years ago
image

New fiteted graph