Azure / Azure-TDSP-Utilities

Utilities and scripts developed as part of Microsoft's Team Data Science Process for productive data science
Creative Commons Attribution 4.0 International
373 stars 275 forks source link

Challenges with Error Messages in Binary-Classification Modeling #19

Closed Pelonza closed 7 years ago

Pelonza commented 7 years ago

I've been trying to use the binary classification module on classic Titanic data-set and running into quite a few challenges...

Some of this could be cleared up by making sure there's better error-code catching/reporting...

For example: 1) I was running with missing values still in the Age feature (from the basic training dataset from Kaggle.com). -- The error I was getting was that I was trying to sort something that was a list. After a lot of head-scratching, I realized it was trying to sort something that still had non-numeric values in it... which was probably Age... After removing/dealing with them, I was able to get AMR to run (farther).

2) After cleaning the data up some more (and limiting it to a lot less columns to reduce issues-- Specifically: Age and Fare) ... I'm getting an error when trying to run the glmnet model, that "train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()" After some investigation, the controlObject for glmnet IS properly set to have classProbs=TRUE... but obviously there's an error somewhere in actually computing those class probabilities which did NOT properly raise an error/exception. I'm still trying to trace back and figure out where that might be... but there's clearly some info missing in these error messages...


It's also possible there's a lot more errors in my "input" files...


Perhaps an alternative would be to have a more comprehensive "check dataset" tool that made sure the input data-sets (as specified by the yaml with exclusions/inclusions) met the expected formats to be able to run on models. Then, if not, give a report of errors.

In some sense, this seems to be missing between the IDEAR and AMR tools... while IDEAR lets you see what state the data is in, there's not (or at least, I seem to have missed it) clear specifications on the condition data/data-frames need to be in for running AMR.

Pelonza commented 7 years ago

After a bit more digging: Apparently your "y" column for the models needs to be input as a "factor"

http://stats.stackexchange.com/questions/26084/how-do-i-compute-class-probabilities-in-caret-package-using-glmnet-method

Which at least gets rid of my error... but is probably worth throwing in a type-check within the AMR code (easier to remind people there is an error in their process than parsing obtuse error messages).

deguhath commented 7 years ago

Thanks for testing the AMAR utility. Agree that more error checking would be useful to include in the code. Also, we agree that there is room for a data preparation, transformation, and featurization utility between IDEAR and AMAR tools. To help running of the AMAR tool itself, we have tried to provide detailed information here: https://github.com/Azure/Azure-TDSP-Utilities/blob/master/DataScienceUtilities/Modeling/team-data-science-process-automated-modeling-reporting-instructions.md, and based on your running experience seems like it could be enhanced as well.

Thank you very much for all your feedback. We will keep these in consideration when investing further in improving the AMAR utility.