Repository for work related to a interactive data dashboard that can be used to analyze how different dog characteristics may correlate with average length of stay in a shelter prior to adoption.
Most machine learning models expect exclusively numeric input features. Some (most?) of our features are categories (puppy, young, adult... or breed names for example).
Let's use pandas.DataFrame as the data structure in preparing our dataset for modeling. Scikit-learn, the most commonly used ML package, supports this datatype for running models.
Preprocessing ideas:
[x] Any True/False feature can be converted to 0/1 values
[x] Any ordinal features (like age category, or size category) should be mapped to numbers (e.g. baby: 0, young: 1, adult: 2, senior: 3)
[x] Categorical variables without ordering should be one-hot encoded. Scikit-learn has a helpful function here. For dog breeds, since there's a huge number of possibilities, I'd recommend only keeping the most popular breeds. So, you might play around with the parameter min_frequency or max_categories so we don't end up with more than ~50 or so breeds
[x] Drop any columns that we don't want included in the model. For example, name or ID number should probably be dropped
I think it would make the most sense to organize this as a new step that runs on the output from data_cleaner. We could call it data_preprocessor?
Most machine learning models expect exclusively numeric input features. Some (most?) of our features are categories (puppy, young, adult... or breed names for example).
Let's use pandas.DataFrame as the data structure in preparing our dataset for modeling. Scikit-learn, the most commonly used ML package, supports this datatype for running models.
Preprocessing ideas:
min_frequency
ormax_categories
so we don't end up with more than ~50 or so breedsI think it would make the most sense to organize this as a new step that runs on the output from
data_cleaner
. We could call itdata_preprocessor
?