bobchen1701 / Bios8366HW

0 stars 0 forks source link

HW3 #3

Open bobchen1701 opened 5 years ago

bobchen1701 commented 5 years ago

@nstrayer

nstrayer commented 5 years ago

1

While city of embarkment itself may not have affected survival probability it could be an indicator for something that did such as socioeconomic status.

By simply excluding rows that had missingness in them you are potentially biasing your model. A simple solution in this case would be to impute the missing values with something like the mean or median. A more complex solution would be to impute using fancy multiple imputation models. Anytime you can avoid throwing away data in modeling you should. If this had been a problem where you wanted to interpret the results (like the coefficients of the logistic regression) by omitting those with missing you have invalidated your parameter estimates. Since we're just predicting here it's not as big of a deal but I think you'd find you get better performance in your model after imputing just due to the model having more data to learn from on the non-imputed columns.

You wisely used gridsearchCV for your logistic regression tuning, why not for SVM? By looking at kernels and fixing the other hyperparameters you are missing out on performance gains that come from exploring the interaction of the hyperparameters. Also, plotting the grid-search results is always valuable to let you know if you searched an extensive enough range or if your model was continuing to improve.

When you are looking at the performance of these models you are going to want to use a metric that is robust to unbalanced outcomes. In this case there is not an equal number of survived vs not-survived so accuracy is not great. A better score to look at is F1 which accounts for this imbalance.

-3

2

So here you actually did the right thing in dropping your NAs. If you had imputed here you would have falsely forced the model to be confident around the imputed value (think about those ribbon plots showing the confidence bands getting squeezed around a point). In fact, Gaussian Processes are a very powerful imputation method themselves.

These results look okay but I'd look into the periodic covariance function. It has the nice ability to cleanly specify the period of the oscillation which since we have seasonal data we can put a pretty dang strong prior on 365 days. This helps us avoid needing to use so many inducing points and allows the model to perform reasonably in predictions outside the data support range. If you just replace your covariance with pm.gp.cov.Periodic(1, period=365, ls = ls) you will find the model converges much better.

I like your plots a lot here (although some of the text is a bit small).

3

Good job with the gridsearch here. That being said, plots of the results are needed to really investigate the patterns. With a combination of colored lines and faceting you can get a good look at the patterns in a single big plot. We want to look for if there are interesting interactions between hyperparameter values. Also, like mentioned in the first problem, plotting helps you make sure you chose a wide enough range for your parameters to be searched over. For instance, with random forests you want to let your trees get very deep if needed and then the multiple-model component averages out the over-fitting. Your search found a best value of 4, but if you looked at the patterns that created that you are most likely leaving performance on the table by not letting it explore deeper trees with different values of the other hypers.

The feature importance plot is a good addition. These can often help you understand the tuning results well.

-3

4

Here you let the max-depth go down to 1 for the gradient boosted tree! Good!

A way of getting at if you've over or under fit is to plot the train and test accuracy trajectories as you increase the proportion of your training data you use to train. An over-fit model will rapidly shoot its training accuracy to 1 and then the testing will plateau. An under-fit model will plateau its training and testing rather quickly.

You will almost always see that your train accuracy is higher than your test one regardless so it's hard to say if you've over or under-fit with just that single datapoint.

-3

Grade: 31/40