emilio-berti commented 4 years ago

Background

I was in Belfast at BES and was talking with some people about variable selection. When I said I was selecting them using a step-wise AIC(c) approach, a guy (A) looked at me in shock and horror. Apparently, I was doing it all wrong. In summary, step-wise selection introduces some biases that give, at the end, unfair results. A then told me that the new method to be used to not introduce such biases is the LASSO regression.

Challenge

We want to understand which factors determine the quality (quality) of the vinho verde from white grapes. The data to investigate this is archived at https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv. A description of the dataset is available at https://archive.ics.uci.edu/ml/datasets/Wine+Quality.

Instructions

Write your solution in a script.txt file and attach it as a reply to this issue.
There is no package restriction

Issue will be closed on September 30

What we want to model

d <- read.csv("winequality-white.csv", sep = ";")
str(d)
hist(d$quality, #we want to model this
 col = "grey80",
 main = "",
 xlab = "Quality")

emilio-berti commented 4 years ago

Emilio.txt

ErikKusch commented 4 years ago

Erik.txt

emilio-berti commented 4 years ago

Hallo!

Double of the people that usually reply joined for this time - thanks Erik :) I am using an older version of glmnet (2.0-18) for dependencies issues. Here's the result:

Emilio

Erik

Forgot to load the dataset? Error in terms.formula(object, data = data) : object 'd' not found

Results for lasso:

12 x 1 sparse Matrix of class " (Intercept) 15.502342425 fixed.acidity .
volatile.acidity -1.786555255 citric.acid .
residual.sugar 0.024798381 chlorides -0.189039780 free.sulfur.dioxide 0.003043232 total.sulfur.dioxide .
density -13.448154283 pH 0.061046933 sulphates 0.292670997 alcohol 0.345966111

No results for step-AIC (?)

Conclusions

Step-AIC and LASSO regression give similar results in this case. However, LASSO includes in the best model also a very small, but significant effect of citric acid and chlorides. In general, we like our vinho verde sweet and strong and quite smooth (low acidity and density). I agree ;)

What I learnt:

LASSO is quite easy to perform and fast in execution. We don't need to input many parameters and the most important one (lambda) can be found heuristically very easily.

What I am still missing:

how to compute confidence intervals for the LASSO? Is it actually feasible and meaningful compute them?

emilio-berti / monthly-programming-challenge

Select best model using stepwise AIC vs lasso #4

Background

Challenge

Instructions

What we want to model

Emilio

Erik

Conclusions