emilio-berti / monthly-programming-challenge

0 stars 0 forks source link

Select best model using stepwise AIC vs lasso #4

Open emilio-berti opened 4 years ago

emilio-berti commented 4 years ago

Background

I was in Belfast at BES and was talking with some people about variable selection. When I said I was selecting them using a step-wise AIC(c) approach, a guy (A) looked at me in shock and horror. Apparently, I was doing it all wrong. In summary, step-wise selection introduces some biases that give, at the end, unfair results. A then told me that the new method to be used to not introduce such biases is the LASSO regression.

Challenge

We want to understand which factors determine the quality (quality) of the vinho verde from white grapes. The data to investigate this is archived at https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv. A description of the dataset is available at https://archive.ics.uci.edu/ml/datasets/Wine+Quality.

Instructions

  1. Write your solution in a script.txt file and attach it as a reply to this issue.
  2. There is no package restriction
  3. Issue will be closed on September 30

    What we want to model

    d <- read.csv("winequality-white.csv", sep = ";")
    str(d)
    hist(d$quality, #we want to model this
     col = "grey80",
     main = "",
     xlab = "Quality") 

    image

emilio-berti commented 4 years ago

Emilio.txt

ErikKusch commented 4 years ago

Erik.txt

emilio-berti commented 4 years ago

Hallo!

Double of the people that usually reply joined for this time - thanks Erik :) I am using an older version of glmnet (2.0-18) for dependencies issues. Here's the result:

Emilio

image

Erik

Forgot to load the dataset? Error in terms.formula(object, data = data) : object 'd' not found

Results for lasso:

12 x 1 sparse Matrix of class " (Intercept) 15.502342425 fixed.acidity .
volatile.acidity -1.786555255 citric.acid .
residual.sugar 0.024798381 chlorides -0.189039780 free.sulfur.dioxide 0.003043232 total.sulfur.dioxide .
density -13.448154283 pH 0.061046933 sulphates 0.292670997 alcohol 0.345966111

No results for step-AIC (?)

Conclusions

Step-AIC and LASSO regression give similar results in this case. However, LASSO includes in the best model also a very small, but significant effect of citric acid and chlorides. In general, we like our vinho verde sweet and strong and quite smooth (low acidity and density). I agree ;)

What I learnt:

What I am still missing: