Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4k stars 2.49k forks source link

AutoMLConfig "preprocess" argument does not work as expected. #135

Closed dataders closed 5 years ago

dataders commented 5 years ago

Why does the pipeline still include a pre-processing script when:

My understanding is that LightGBM shouldn't wouldn't even benefit from scaling of numeric features.

Steps to Reproduce

When I set preprocess = False and blacklist the preprocessor, MaxAbsScaler In the output experiment submission I see: MaxAbsScaler LightGBM as the name of the pipeline.

Two lines of the log file stand out to me:

2018-12-12 15:43:07,434 - INFO - 107 : No preprocessing of data to be done here` in the log file
2018-12-12 15:25:37,116 - INFO - 491 : Start executing pipeline

    {
        "pipeline_id": "7b69e5810e1d3cc689113164357e84afece9f816",
        "objects": [
            {
                "param_args": [],
                "prepared_kwargs": {},
                "module": "sklearn.preprocessing",
                "spec_class": "preproc",
                "class_name": "StandardScaler",
                "param_kwargs": {
                    "with_std": true,
                    "with_mean": false
                }
            },
            {
                "param_args": [],
                "prepared_kwargs": {},
                "module": "automl.client.core.common.model_wrappers",
                "spec_class": "sklearn",
                "class_name": "LightGBMClassifier",
                "param_kwargs": {
                    "boosting_type": "gbdt",
                    "colsample_bytree": 0.6933333333333332,
                    "min_split_gain": 0.10526315789473684,
                    "max_bin": 90,
                    "learning_rate": 0.026323157894736843,
                    "n_estimators": 200,
                    "min_child_weight": 2,
                    "reg_lambda": 0.10526315789473684,
                    "max_depth": 9,
                    "reg_alpha": 0.7368421052631579,
                    "min_data_in_leaf": 0.010353793103448278,
                    "subsample": 0.3963157894736842,
                    "num_leaves": 116
                }
            }
        ]
    }

More info

Params

automl_config = AutoMLConfig(
    task = 'classification',
    num_classes = 2,
    debug_log = 'automl_errors.log',
    primary_metric = 'AUC_weighted',
    iteration_timeout_minutes = 200,
    max_cores_per_iteration = 2,
    iterations = 1,
    verbosity = logging.DEBUG,
    X = X_train, 
    y = y_train,
    X_valid = X_test,
    y_valid = y_test,
    path=project_folder,
    preprocess = False,
    whitelist_models = ['LightGBM'],
    blacklist_models = ['SparseNormalizer', 'MaxAbsScaler'],
    enable_ensembling = False,
    enable_cache = False
)

Experiment Output

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   MaxAbsScaler LightGBM                          0:00:26       0.5000    0.5000

Log

2018-12-12 15:43:07,287 - INFO - 436 : [ParentRunID:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd] Local run using input X and y.
2018-12-12 15:43:07,294 - INFO - 441 : [ParentRunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd]SDK dependencies versions:{"azuremlftk": "0.1.18323.5a1", "azureml-widgets": "1.0.2", "azureml-train": "1.0.2", "azureml-train-restclients-hyperdrive": "1.0.2", "azureml-train-core": "1.0.2", "azureml-train-automl": "1.0.2", "azureml-telemetry": "1.0.2", "azureml-sdk": "1.0.2", "azureml-pipeline": "1.0.2", "azureml-pipeline-steps": "1.0.2", "azureml-pipeline-core": "1.0.2", "azureml-explain-model": "1.0.2", "azureml-dataprep": "0.5.3", "azureml-dataprep-native": "11.1.2", "azureml-core": "1.0.2"}.
2018-12-12 15:43:07,418 - INFO - 447 : Parent Run ID: AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd
2018-12-12 15:43:07,418 - INFO - 898 : [ParentRunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd]Input X datatype is <class 'numpy.ndarray'>, shape is (90, 2), datasize is 1552.
2018-12-12 15:43:07,419 - INFO - 898 : [ParentRunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd]Input y datatype is <class 'numpy.ndarray'>, shape is (90,), datasize is 816.
2018-12-12 15:43:07,420 - INFO - 898 : [ParentRunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd]Input X_valid datatype is <class 'numpy.ndarray'>, shape is (10, 2), datasize is 272.
2018-12-12 15:43:07,420 - INFO - 898 : [ParentRunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd]Input y_valid datatype is <class 'numpy.ndarray'>, shape is (10,), datasize is 176.
2018-12-12 15:43:07,421 - INFO - 40 : Pre-processing user data
2018-12-12 15:43:07,425 - INFO - 658 : [YCol]RawFeatureStats:{"date_regex1": "{}", "date_regex2": "{}", "num_unique_vals": "2", "total_number_vals": "90", "lengths": "{}", "num_unique_lens": "1", "column_type": "\"integer\"", "average_entry_length": "1.0", "average_number_spaces": "0.0", "num_na": "{}", "is_datetime": "false", "cardinality_ratio": "0.022222222222222223"}
2018-12-12 15:43:07,430 - INFO - 658 : [XColNum:0]RawFeatureStats:{"date_regex1": "{}", "date_regex2": "{}", "num_unique_vals": "33", "total_number_vals": "90", "lengths": "{}", "num_unique_lens": "1", "column_type": "\"floating\"", "average_entry_length": "4.0", "average_number_spaces": "0.0", "num_na": "{}", "is_datetime": "false", "cardinality_ratio": "0.36666666666666664"}
2018-12-12 15:43:07,433 - INFO - 658 : [XColNum:1]RawFeatureStats:{"date_regex1": "{}", "date_regex2": "{}", "num_unique_vals": "41", "total_number_vals": "90", "lengths": "{}", "num_unique_lens": "3", "column_type": "\"floating\"", "average_entry_length": "3.6333333333333333", "average_number_spaces": "0.0", "num_na": "{}", "is_datetime": "false", "cardinality_ratio": "0.45555555555555555"}
2018-12-12 15:43:07,434 - INFO - 107 : No preprocessing of data to be done here
2018-12-12 15:43:08,071 - INFO - 552 : Start local loop.
2018-12-12 15:43:08,072 - INFO - 555 : Start iteration: 0
2018-12-12 15:43:08,073 - INFO - 595 : Querying Jasmine for next pipeline.
2018-12-12 15:43:09,472 - INFO - 610 : Received pipeline: {"pipeline_id": "7a180470d25072d77c0e4488e70b2eb5bb959acc", "objects": [{"param_args": [], "prepared_kwargs": {}, "module": "sklearn.preprocessing", "spec_class": "preproc", "class_name": "MaxAbsScaler", "param_kwargs": {}}, {"param_args": [], "prepared_kwargs": {}, "module": "automl.client.core.common.model_wrappers", "spec_class": "sklearn", "class_name": "LightGBMClassifier", "param_kwargs": {"boosting_type": "gbdt", "colsample_bytree": 0.1988888888888889, "min_split_gain": 1, "max_bin": 130, "learning_rate": 0.06842421052631578, "n_estimators": 25, "min_child_weight": 4, "reg_lambda": 0.21052631578947367, "max_depth": 10, "reg_alpha": 0.8421052631578947, "min_data_in_leaf": 0.017249655172413794, "subsample": 0.8910526315789474, "num_leaves": 119}}]}
2018-12-12 15:43:09,840 - INFO - 99 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0]CPU logical cores: 4, CPU cores: 2, virtual memory: 17037811712, swap memory: 20393254912.
2018-12-12 15:43:09,842 - INFO - 105 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0]Platform information: Windows-10-10.0.17134-SP0.
2018-12-12 15:43:09,842 - INFO - 330 : None
2018-12-12 15:43:10,048 - INFO - 72 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0][Starting fit_pipeline]memory usage 346,492 K
2018-12-12 15:43:10,049 - INFO - 83 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0][Starting fit_pipeline]cpu time 0:03:19
2018-12-12 15:43:10,050 - INFO - 367 : [ParentRunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0] X datatype is <class 'numpy.ndarray'>, shape is (90, 2), datasize is 1552.
2018-12-12 15:43:10,050 - INFO - 372 : [ParentRunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0] y datatype is <class 'numpy.ndarray'>, shape is (90,), datasize is 816.
2018-12-12 15:43:10,050 - INFO - 378 : [ParentRunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0] X_valid datatype is <class 'numpy.ndarray'>, shape is (10, 2), datasize is 272.
2018-12-12 15:43:10,051 - INFO - 384 : [ParentRunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0] y_valid datatype is <class 'numpy.ndarray'>, shape is (10,), datasize is 176.
2018-12-12 15:43:10,051 - INFO - 392 : Created child run AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0
2018-12-12 15:43:10,221 - INFO - 72 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0][Before preprocess]memory usage 346,492 K
2018-12-12 15:43:10,221 - INFO - 83 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0][Before preprocess]cpu time 0:03:19
2018-12-12 15:43:10,395 - INFO - 72 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0][After preprocess]memory usage 346,492 K
2018-12-12 15:43:10,395 - INFO - 83 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0][After preprocess]cpu time 0:03:19
2018-12-12 15:43:10,568 - INFO - 72 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0][Before executing pipeline]memory usage 346,492 K
2018-12-12 15:43:10,569 - INFO - 83 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0][Before executing pipeline]cpu time 0:03:19
2018-12-12 15:43:10,569 - INFO - 491 : Start executing pipeline {"pipeline_id": "7a180470d25072d77c0e4488e70b2eb5bb959acc", "objects": [{"param_args": [], "prepared_kwargs": {}, "module": "sklearn.preprocessing", "spec_class": "preproc", "class_name": "MaxAbsScaler", "param_kwargs": {}}, {"param_args": [], "prepared_kwargs": {}, "module": "automl.client.core.common.model_wrappers", "spec_class": "sklearn", "class_name": "LightGBMClassifier", "param_kwargs": {"boosting_type": "gbdt", "colsample_bytree": 0.1988888888888889, "min_split_gain": 1, "max_bin": 130, "learning_rate": 0.06842421052631578, "n_estimators": 25, "min_child_weight": 4, "reg_lambda": 0.21052631578947367, "max_depth": 10, "reg_alpha": 0.8421052631578947, "min_data_in_leaf": 0.017249655172413794, "subsample": 0.8910526315789474, "num_leaves": 119}}]}.
2018-12-12 15:43:10,569 - INFO - 494 : Running with the following AutoML settings:
 - name: automl-local-classification
 - subscription_id: ff2e23ae-7d7c-4cbd-99b8-116bb94dca6e
 - iterations: 1
 - primary_metric: AUC_weighted
 - compute_target: local
 - task_type: classification
 - validation_size: 0.0
 - n_cross_validations: None
 - y_min: None
 - y_max: None
 - num_classes: 2
 - preprocess: False
 - lag_length: 0
 - is_timeseries: False
 - max_cores_per_iteration: 2
 - max_concurrent_iterations: 1
 - iteration_timeout_minutes: 200
 - mem_in_mb: None
 - enforce_time_on_windows: True
 - experiment_timeout_minutes: None
 - experiment_exit_score: None
 - whitelist_models: ['LightGBM', 'LightGBMClassifier']
 - blacklist_algos: []
 - auto_blacklist: True
 - blacklist_samples_reached: False
 - exclude_nan_labels: True
 - verbosity: 10
 - show_warnings: False
 - model_explainability: False
 - service_url: None
 - sdk_url: None
 - sdk_packages: None
 - telemetry_verbosity: INFO
 - send_telemetry: True
 - spark_context: None
 - spark_service: None
 - metrics: None
 - enable_ensembling: False
 - ensemble_iterations: None
 - enable_tf: True
 - enable_cache: False
 - enable_subsampling: False
 - subsample_seed: None
 - cost_mode: 0
 - metric_operation: maximize
2018-12-12 15:43:15,574 - INFO - 72 : memory usage 346,532 K
2018-12-12 15:43:15,576 - INFO - 83 : cpu time 0:03:20
2018-12-12 15:43:25,790 - INFO - 72 : memory usage 346,532 K
2018-12-12 15:43:25,800 - INFO - 83 : cpu time 0:03:20
2018-12-12 15:43:30,870 - INFO - 72 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0][After executing pipeline]memory usage 346,532 K
2018-12-12 15:43:30,879 - INFO - 83 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0][After executing pipeline]cpu time 0:03:21
2018-12-12 15:43:30,880 - INFO - 665 : Pipeline execution finished with a score of 0.5
2018-12-12 15:43:30,881 - INFO - 696 : Start logging metrics for child run.
2018-12-12 15:43:31,578 - WARNING - 1288 : Did not recognize metric: train time. Will not log.
2018-12-12 15:43:31,592 - INFO - 1270 : The following metrics have been logged for the child run: {'accuracy': 0.5, 'weighted_accuracy': 0.5, 'norm_macro_recall': 0.0, 'log_loss': 0.6931471805599453, 'confusion_matrix': <class 'dict'>, 'accuracy_table': <class 'dict'>, 'precision_score_micro': 0.5, 'average_precision_score_macro': 0.5, 'f1_score_weighted': 0.3333333333333333, 'recall_score_micro': 0.5, 'f1_score_macro': 0.3333333333333333, 'average_precision_score_micro': 0.5, 'precision_score_macro': 0.25, 'balanced_accuracy': 0.5, 'AUC_micro': 0.5, 'f1_score_micro': 0.5, 'precision_score_weighted': 0.25, 'average_precision_score_weighted': 0.5, 'recall_score_weighted': 0.5, 'recall_score_macro': 0.5, 'AUC_weighted': 0.5, 'AUC_macro': 0.5, 'train time': <class 'float'>}.
2018-12-12 15:43:33,168 - INFO - 72 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0][End fit_pipeline]memory usage 348,156 K
2018-12-12 15:43:33,168 - INFO - 83 : [RunId:AutoML_ae179b53-9230-4538-8d07-2a8f29ae4abd_0][End fit_pipeline]cpu time 0:03:22
2018-12-12 15:43:34,608 - INFO - 588 : Run Complete.
ggupta2005 commented 5 years ago

We have two flavors of pre-processing in AutoML, one is model independent featurization and another is model specific preprocessing. The flag "preprocess=False" disables the model independent featurization but doesn't affect the model dependent pre-processing.

dataders commented 5 years ago

ah i see now. I incorrectly assumed that the fitted_model returned by

best_run, fitted_model = remote_run.get_output()

would be a sklearn model object, a model and a preprocessing wrapped into a Pipeline.

Is there a way to do all the necessary pre-processing outside of the experiment and create an AutoML experiment that just iterates through algorithms?

dataders commented 5 years ago

My other intention is to make force plots using shap, which is seems already constitutes the majority of the model_explainer functionality`

ggupta2005 commented 5 years ago

Let me find out if there is way to use AutoML pre-processing outside of AutoML and feed in the featurized data into AutoML and train over the featurized data using preprocess=False.

For the second query, you could try setting model_explainability to True for model explanations.

dataders commented 5 years ago

That would be helpful, thanks. Doing all the necessary pre-processing before running AutoML was our plan all along, but looking at the source code it seems not likely.

If it is indeed not possible, my ask is this: A way to parse out the actual model the fitted_model object (which is actually a Pipeline object).

best_run, fitted_model = local_run.get_output()

The reason is that I want to create SHapley Additive exPlanation plots using the shap package. The package cannot currently interpret output that AuoML currently provides.

The irony is that the azure_ml-sdk actually imports shap and uses it to provide explanations via the model_explainer argument. My ask is that I can extend the explanations provided to make plots (see below).

Shap example

laurentiuamitroaie commented 4 years ago

Hi ggupta2005, you did not responded to this one:

Let me find out if there is way to use AutoML pre-processing outside of AutoML and feed in the featurized data into AutoML and train over the featurized data using preprocess=False.

dataders commented 4 years ago

@laurentiuamitroaie, I spoke with the featurization team from AutoML, I believe this is something they are considering. I'll ask the person I spoke to respond here.