InseadDataAnalytics / INSEADAnalytics

Other
122 stars 1.31k forks source link

Classification analysis #74

Open Tommicus opened 7 years ago

Tommicus commented 7 years ago

https://github.com/Tommicus/GOLDHR/blob/master/GOLDHR.Rmd

In the classification analysis

We used the code from the course website, adjusting for our data set and our problem

The CART1, CART2 and Logistic Regr. return a -1.00 in the variable importance for the first independent variable (Satisfaction level in our base case) in the set (we tried multiple independent variables)

The confusion matrix (validation) returns a huge Type 1 error (99.58%) (people staying although we predicted them to leave) but a relatively small Type 2 error (8.43%). If we increase the probability threshold, the Type 1 error doesn’t decrease which is counter intuitive.

For the Test Accuracy confusion matrix the result is pretty much the same.

Can this be because some of the coefficients for the logistic regression are not significant? If we exclude them the results are pretty much the same

The logistic regression produces a lot of mistakes

Any help much appreciated

egor-gazarov commented 7 years ago

Don't you want to try classification trees? We found C5.0 package extremely good for predicting correct class from the first run. Our files are here: https://github.com/egor-gazarov/PredictingEmployeesLeave-INSEAD17J-GP

Tommicus commented 7 years ago

Any advice on the Type 1 error please?