bmurauer / pipelinehelper

scikit-helper to hot-swap pipeline elements
GNU General Public License v3.0
21 stars 9 forks source link

scoring with ROC_AUC #2

Closed jessequinn closed 5 years ago

jessequinn commented 5 years ago

Great work here,

and I do realize ROC_AUC doesn't work due to lack of forwarding of predict_proba etc.. but I was wondering do you have an ETA on this? Would love to use your class with roc_auc scoring.

bmurauer commented 5 years ago

Good catch, thanks! I should be able to fix that in a few days.

jessequinn commented 5 years ago

One other thing.

In your example you run nb_pipe with minmaxscaler in addition std and max scaler. Why?

from what I understand, std will run first and normalize the data around 1 while minmax runs just before the multinomialnb and adjust the scale between 0 and 1. Seems a little redundant or I am wrong?

I only ask as the following line of code was spit out from your example on a set of data I'm using.

# Tuning hyper-parameters for accuracy

Fitting 3 folds for each of 3738 candidates, totalling 11214 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   12.3s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   21.8s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   39.7s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  9.5min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed: 15.8min
[Parallel(n_jobs=-1)]: Done 6034 tasks      | elapsed: 71.6min
[Parallel(n_jobs=-1)]: Done 7184 tasks      | elapsed: 211.5min
[Parallel(n_jobs=-1)]: Done 8434 tasks      | elapsed: 967.1min
[Parallel(n_jobs=-1)]: Done 9784 tasks      | elapsed: 982.8min
[Parallel(n_jobs=-1)]: Done 11214 out of 11214 | elapsed: 998.4min finished
{'classifier__selected_model': ('nb_pipe', {'nb__fit_prior': True, 'nb__alpha': 0.1}), 'scaler__selected_model': ('std', {'with_mean': True, 'with_std': True})}
0.91085

or does it just ignore the minmaxscaler or in fact it just runs the minmaxscaler and mislabels it?

bmurauer commented 5 years ago

You are absolutely right, i will change the example to contain more useful pipeline elements.

Originally, i used the helper with two different scalers, where one needed dense data and one could work on sparse data. I wanted to show that the "densifyer' could be combined with the according scaler.

The Min-Max-Scaler is "required" because the NB does not run on negative values, but I agree that this example is misleading.

The output shows that the nb_pipe yielded the best results. However, only the parameters that were provided explicitly to this part of the pipeline (np__fit_prior and nb__alpha) will show up in the result list. This means that the MinMaxScaler will have used the parameters at its definition (line 27).

jessequinn commented 5 years ago

essentially it should be calling with default parameters that is to say MinMaxScaler() as no parameters were assigned?

Anyways thanks for the class. Quite useful for me, but more so once roc_auc scoring works.