MichelleLochner / supernova-machine

0 stars 1 forks source link

More GridSearch problems #7

Open mkerrwinter opened 9 years ago

mkerrwinter commented 9 years ago

I thought I'd figured out why I was getting different AUC values for GridSearch's internal ROC AUC function and our own one by saying GridSearch was using the SVC.decision_function values as "probabilities". I've now realised that this is quite silly as the decision function for SVC is the distance between a data point and the separating hyperplane - i.e. it is not bounded so definitely can't be used like a probability as is needed for a ROC curve. So I went back to looking through the source code to find the bit where the 'scores' (as they call the probabilities or probability-esque values to be sent to the ROC AUC calculating function). But I've hit a bit of a dead end. GridSearchCV.fit uses a function called sklearn.metrics._score to get the score values. And sklearn.metrics._score (as far as I can see) calls 'scorer' which is a function defined by the user when they originally call GridSearchCV (AKA 'roc_auc' in our case), passing it the variables 'estimator' and 'X_test' (so in our case 'an SVM' and 'some feature data'). According to this website: http://scikit-learn.org/stable/modules/model_evaluation.html the input string 'roc_auc' (AKA the function 'scorer') corresponds to the code sklearn.metrics.roc_auc_score. Buuuuuuuuut, roc_auc_score takes as input parameters 'y_true' and 'y_score', not an estimator and some feature data! So I don't understand why this doesn't just give an error, and I also don't understand where in the code the probability scores are being calculated. Do you have any ideas/suggestions? (sorry for the massive post)

MichelleLochner commented 9 years ago

Hey, let's chat about this tomorrow.

On Thu, Feb 19, 2015 at 5:35 PM, mkerrwinter notifications@github.com wrote:

I thought I'd figured out why I was getting different AUC values for GridSearch's internal ROC AUC function and our own one by saying GridSearch was using the SVC.decision_function values as "probabilities". I've now realised that this is quite silly as the decision function for SVC is the distance between a data point and the separating hyperplane - i.e. it is not bounded so definitely can't be used like a probability as is needed for a ROC curve. So I went back to looking through the source code to find the bit where the 'scores' (as they call the probabilities or probability-esque values to be sent to the ROC AUC calculating function). But I've hit a bit of a dead end. GridSearchCV.fit uses a function called sklearn.metrics._score to get the score values. And sklearn.metrics._score (as far as I can see) calls 'scorer' which is a function defined by the user when they originally call GridSearchCV (AKA 'roc_auc' in our case), passing it the variables 'estimator' and 'X_test' (so in our case 'an SVM' and 'some feature data'). According to this website: http://scikit-learn.org/stable/modules/model_evaluation.html the input string 'roc_auc' (AKA the function 'scorer') corresponds to the code sklearn.metrics.roc_auc_score. Buuuuuuuuut, roc_auc_score takes as input parameters 'y_true' and 'y_score', not an estimator and some feature data! So I don't understand why this doesn't just give an error, and I also don't understand where in the code the probability scores are being calculated. Do you have any ideas/suggestions? (sorry for the massive post)

— Reply to this email directly or view it on GitHub https://github.com/MichelleLochner/supernova-machine/issues/7.