Preprocess Features for a Model

Most machine learning models expect exclusively numeric input features. Some (most?) of our features are categories (puppy, young, adult... or breed names for example).

Let's use pandas.DataFrame as the data structure in preparing our dataset for modeling. Scikit-learn, the most commonly used ML package, supports this datatype for running models.

Preprocessing ideas:

[x] Any True/False feature can be converted to 0/1 values
[x] Any ordinal features (like age category, or size category) should be mapped to numbers (e.g. baby: 0, young: 1, adult: 2, senior: 3)
[x] Categorical variables without ordering should be one-hot encoded. Scikit-learn has a helpful function here. For dog breeds, since there's a huge number of possibilities, I'd recommend only keeping the most popular breeds. So, you might play around with the parameter min_frequency or max_categories so we don't end up with more than ~50 or so breeds
[x] Drop any columns that we don't want included in the model. For example, name or ID number should probably be dropped

I think it would make the most sense to organize this as a new step that runs on the output from data_cleaner. We could call it data_preprocessor?

code-312 / rescue-chicago

Preprocess Features for a Model #1