Open mingweiiiiiiiiii opened 2 years ago
Note MSE on test dataset not training dataset .That is very important
Our MSE is better than most online solution on kaggle or other websites .
I would love to see the model diagnostic ressults. Search on Regression diagnostic visualisation python. Also share the total rows and the rows you used in the training set. Then share the model results. Would love to check the feature significance.
Thanks for your hint and you help our project a lot.
Diagnotics picture
Fit regression line
I used the 2016 dataset,after removing operation . The total rows are 88465,the size of training data set is 61925
Feature importance ,I use the first two features .And then adding the third biggest feature,there have P >[t} >0.05 Not Signficance
VIF test of these two feature
Mean square error on test dataset for linear regression model
residual only for finishing living area feature
Taxamount residual
Pairplot
Q-Q plot for y predict
Q-q plot for y-test(actual log error )
These two features have relative largest variance from describe() command in sklearn and top 2 features from random forest regression
@1978abhay
Hi @1978abhay : I have uploaded the picture that you mentioned today ,it would be great that you could check Thanks very much. Kind regards Mingwei
Correlation matrix
1 If I add more feature, it would add VIF value >10 and sometimes P value >0.05 for some features 2 But no matter how many features I add,there is still the same fitting line since you could see from the original graph .The original line is very dense at middle and sparse at the both side 3 I do not how to do this for this 4 I use K-nearest neighbor to do the data imputation, remove the outlier for the target variable that (>Zscore 3.0) to make it normal. Privoss times I used Box-cox to do the data transformation,but it make some feature wired,for example .The TAXAMOUNT from 6300 to 630 .Shrnking 10 times.
Since I didn't know how to do it the next step ,which part fo you want me to demonstrate ? I could demonstrate and might be find some error about it . Thanks .
For 4,I use box-cox to do the transformation ,it make data that cannot detect the outililer on test dataset
Latest feature importance
Latest feature importance
Updated a little bit R squ. @1978abhay
New fiteted graph
Our Mean square value is 0.004099985307764925 The visualization graph is the backend channel (First model jupyter notebook) @1978abhay