Thinkful-Ed / machine-learning-regression-problems

Objective: Students can formulate a research question, conduct preliminary data analysis, and prepare data, and select features, build, evaluate, optimize a model using regression
10 stars 22 forks source link

Add brief explanation of `.get_dummies` to second checkpoint #8

Closed benjaminEwhite closed 6 years ago

benjaminEwhite commented 6 years ago

It should come in earlier material, but not sure it does.

https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/2.simple_linear_regression_models.ipynb

YunusBulut commented 6 years ago

I added the blow part as an explanation to the get_dummies.

"Notice that the categorical variables are strings and we need to convert them to numerical values. This can be viewed as part of a feature enginnering process. One of the most convenient ways of converting categorical variables into numerical ones is called one hot encoding. In one hot encoding, we create a sperate binary variable which takes 0 or 1 for all of the unique values of the categorical variable. Pandas' get_dummies() function does this job for us.

Below, we call the get_dummies() function for the sex and smoker categorical variables in our dataset. Since both sex and smoker variables include two values, the get_dummies() function will create two dummy (indicator) variables for us. Since one of them is enough for us to indicate whether the person is male or not and is a smoker or not, we keep only one of the newly created dummies bot for sex and smoker in our data frame. We do this by feeding the parameter drop_first which is set to True into the get_dummies() function."