I enjoyed reading your report and I think it helped me understand the details of the assignment much better (in particular the beginning exploratory analysis). A few observations/thoughts:
When transforming to categorical variables, you didn't use the functions suggested (the scikit learn preprocessing package). It looks like what you did worked fine, but it's something to look into as it could be faster in the future. I did mine using the preprocessing.LabelEncoder() function
For the decision tree section, decision trees kind of inherently over fit the data, so it's useful to have a max depth or prune it in some way.
For your k-fold cross validation you find the number of folds which yield the maximum accuracy, which isn't really the point of k-folds (the idea is to get the best understanding of what your out of sample accuracy will be).
Your explanation of learning curves wasn't all that thorough (though I did go in not really knowing what they are at all)
Could have used a bit more explanatory text throughout the notebook, but you explained all the main points
I like your crosstab tables, definitely going to start using those
I used preprocessing.LabelEncoder() as well, within the function transform_features().
I dropped out the max_depth when I was performing classification, but yes I agree I should go back and see how the models perform when I limit the depth.
I enjoyed reading your report and I think it helped me understand the details of the assignment much better (in particular the beginning exploratory analysis). A few observations/thoughts:
@ghego, @craigsakuma, @kebaler