MeganBeckett / lubridate-heatmaps_tutorial

Lubridate tutorial and dataviz project of running data
4 stars 2 forks source link

How did you validate your model? #2

Open datawookie opened 6 years ago

datawookie commented 6 years ago

Hi!

I'm intrigued that you simply used glm() to build your model. How did you estimate your AUC before submission? I would have probably used cross validation to get a local estimate of how well my model does before making a submission.

Best regards, Andrew.

MeganBeckett commented 6 years ago

Hi Andrew, thanks for the questions and suggestions - really useful! I went with glm() as a starting point, but realise that a decision tree model would probably be better with this type of classification. I see XGBoost seems to be really popular, especially on Kaggle. My next mistake was using the testing data supplied by the kaggle competition rather than doing cross validation to evaluate my model. I came across the term "data snooping" in looking at this again today! So, i've reworked my code and chose to use the data split method from Caret and kept 10% of data as a validation set, realising the disadvantages that i'm losing some data. I need to look at k-fold cross validation as well. I then looked at how to compute AUROC - I used ROCR to calculate AUC using the validation testing set (and got 0.739). And then finally made predictions on app_test for submission.

datawookie commented 6 years ago

Aha! Cool. Well, since you are now thinking about CV, perhaps you would be interested to know that you can actually do CV from within caret?

For example:

fit<- train(survived ~ ., data = Titanicp, method = "glm",
            na.action = na.omit,
            trControl = trainControl(
              method = "repeatedcv",
              number = 10,
              repeats = 5,
              verboseIter = TRUE
            ))

That was actually one of the examples that I presented during the training that I gave today.

But there's more. You can also get caret to generate the AUC score for you too.

TRCONTROL = trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  verboseIter = TRUE
)

fit <- train(survived ~ ., data = Titanicp, method = "rpart",
             metric = "ROC",
             na.action = na.omit,
             trControl = TRCONTROL)

Now, rather than optimising for accuracy, it will choose the model which is best for AUC.

Best regards, Andrew.

MeganBeckett commented 6 years ago

Oh wow, that's really useful to know. And the method="repeatedcv" specifically specifies k-fold CV. I'll have a look at implementing this! Thanks for the examples, especially for AUC score. I hadn't seen or known about that from within caret either.