Open datawookie opened 6 years ago
Hi Andrew, thanks for the questions and suggestions - really useful! I went with glm() as a starting point, but realise that a decision tree model would probably be better with this type of classification. I see XGBoost seems to be really popular, especially on Kaggle. My next mistake was using the testing data supplied by the kaggle competition rather than doing cross validation to evaluate my model. I came across the term "data snooping" in looking at this again today! So, i've reworked my code and chose to use the data split method from Caret and kept 10% of data as a validation set, realising the disadvantages that i'm losing some data. I need to look at k-fold cross validation as well. I then looked at how to compute AUROC - I used ROCR to calculate AUC using the validation testing set (and got 0.739). And then finally made predictions on app_test for submission.
Aha! Cool. Well, since you are now thinking about CV, perhaps you would be interested to know that you can actually do CV from within caret
?
For example:
fit<- train(survived ~ ., data = Titanicp, method = "glm",
na.action = na.omit,
trControl = trainControl(
method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = TRUE
))
That was actually one of the examples that I presented during the training that I gave today.
But there's more. You can also get caret
to generate the AUC score for you too.
TRCONTROL = trainControl(
method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
verboseIter = TRUE
)
fit <- train(survived ~ ., data = Titanicp, method = "rpart",
metric = "ROC",
na.action = na.omit,
trControl = TRCONTROL)
Now, rather than optimising for accuracy, it will choose the model which is best for AUC.
Best regards, Andrew.
Oh wow, that's really useful to know. And the method="repeatedcv" specifically specifies k-fold CV. I'll have a look at implementing this! Thanks for the examples, especially for AUC score. I hadn't seen or known about that from within caret either.
Hi!
I'm intrigued that you simply used
glm()
to build your model. How did you estimate your AUC before submission? I would have probably used cross validation to get a local estimate of how well my model does before making a submission.Best regards, Andrew.