Proposal: Integrate mlflow tracking capability into mlbox to track hyper-parameter tuning results

Integrate mlflow metric tracking capabilities with mlbox to track hyperparameter tuning results.

Specifically, use mlflow to record results from opt.optimise(space, df), such as these

##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}       
>>> CA ENCODER :{'strategy': 'random_projection'}                                 
>>> FEATURE SELECTOR :{'strategy': 'l1', 'threshold': 0.14233956072607368}        
>>> ESTIMATOR :{'strategy': 'LightGBM', 'max_depth': 11, 'n_estimators': 150, 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 0.9, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'nthread': -1, 'seed': 0}
MEAN SCORE : roc_auc = 0.8654171822997079                                         
VARIANCE : 0.004012038978397332 (fold 1 = 0.8721217546192159, fold 2 = 0.8643388038223101, fold 3 = 0.862663035911476, fold 4 = 0.8673470494709108, fold 5 = 0.8606152676746264)
CPU time: 229.74945211410522 seconds                                              
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}       
>>> CA ENCODER :{'strategy': 'random_projection'}                                 
>>> FEATURE SELECTOR :{'strategy': 'l1', 'threshold': 0.11515410121359584}        
>>> ESTIMATOR :{'strategy': 'LightGBM', 'max_depth': 3, 'n_estimators': 50, 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 0.9, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'nthread': -1, 'seed': 0}
MEAN SCORE : roc_auc = 0.8290992273414608                                         
VARIANCE : 0.005108149403007574 (fold 1 = 0.8369130363622934, fold 2 = 0.8232291100938786, fold 3 = 0.8275436631214681, fold 4 = 0.832933622461327, fold 5 = 0.8248767046683363)
CPU time: 214.41054368019104 seconds                                              
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}       
>>> CA ENCODER :{'strategy': 'label_encoding'}                                    
>>> FEATURE SELECTOR :{'strategy': 'l1', 'threshold': 0.13516993263784385}        
>>> ESTIMATOR :{'strategy': 'LightGBM', 'max_depth': 11, 'n_estimators': 1200, 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 0.9, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'nthread': -1, 'seed': 0}
MEAN SCORE : roc_auc = 0.8768808494914342                                         
VARIANCE : 0.005630079317339047 (fold 1 = 0.8859124741546559, fold 2 = 0.8776688096571675, fold 3 = 0.8707042189527653, fold 4 = 0.8790287247825065, fold 5 = 0.8710900199100762)
CPU time: 277.5210859775543 seconds                                               
100%|██████████| 40/40 [2:59:35<00:00, 249.35s/it, best loss: -0.8773851210459863]

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ BEST HYPER-PARAMETERS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

{'ce__strategy': 'random_projection', 'est__max_depth': 11, 'est__n_estimators': 1200, 'fs__threshold': 0.17540488667771706, 'ne__numerical_strategy': 0}

By using mlflow to track these kind of results, we are able to use mlflow web tracking ui to review and report on those results. This capability is demonstrated in this mlflow talk.

Current proposal is to record each ### testing hyper-parameters...### as mlflow experiments and runs that records the model algorithm, hyperparameter settings and resulting metric.

Does this sound like a reasonable addition to mlbox capabilities?

Key requirement is to provide backward compatibility.

Additional details on the proposal. The signature for the optimise() method of the Optimiser() class will change to be

def optimise(self, space, df, max_evals=40, mlflow_parms=None)

The new parameter mlflow_parms will control mlflow behavior. The default for this parameter is None. This provides backward compatibility, i.e. optimise() will operate the same as before mlflow integration.

If mlflow function is desired, then the call to optimise() will be look like this:

mlflow_parms = {'tracking_uri' : './mlruns',  'experiment_name' : 'mlfow_pytest_experiment'}
best = opt.optimise(space, dict, 4, mlflow_parms = mlflow_parms)

mlflow related parameters: "tracking_uri": mlflow tracking uri to for storing experiment/run data "experiment": mlflow experiment name

These are example output for a working prototype of integrating mlflow with mlbox:

    space = {'ne__numerical_strategy': {"search": "choice", "space": [0]},
             'ce__strategy': {"search": "choice",
                              "space": ["label_encoding"]},
             'fs__threshold': {"search": "uniform",
                               "space": [0.01, 0.3]},
             'est__max_depth': {"search": "choice",
                                "space": [3, 4, 5, 6, 7]}

             }

    mflow_parms = {'tracking_uri' : './mlruns',
                   'experiment_name' : 'mlfow_pytest_experiment'}

    best = opt.optimise(space, dict, 4, mlflow_parms = mflow_parms)

Example of mlflow UI for results of the hyper-parameter tuning:

Example of details captured for a single run:

Example of comparing runs:

Sample chart generated from the hyperpamater tuning: newplot

CSV extract of hyperparameter tuning results hyperparameter_runs.xlsx

Hello @jimthompson5802, Thanks for this nice feature ! It could be great, I will have a look. It would be better if the user does not set any parameters (mflow_params must have default values and be hidden from the user...). Also what about the mlflow requirements ? python versions ? dependencies with mlbox ? Thanks ! Axel

@AxeldeRomblay Thank you for the feedback.

I understand the you'd like to hide mlflow as much as possible from the user. This should be doable. One implication of this approach is that instead of using the mlflow web UI, which requires some knowledge of how mlflow works, to view results of the optimize() run, we may need a new method to the Optimiser() class to extract results into a pandas data frame, something like extract_optimise_results()

I'll proceed along these lines.

If you liked to see an early preview of this work, it is in my fork of your repo. It is in the branch integrate_mlflow.

Re: Python requirements for mlflow. mlflow runs under Python 2.7 and Python3.x. This is the package dependencies from mlflow's setup.py

    install_requires=[
        'alembic',
        'click>=7.0',
        'cloudpickle',
        'databricks-cli>=0.8.7',
        'requests>=2.17.3',
        'six>=1.10.0',
        'waitress; platform_system == "Windows"',
        'gunicorn; platform_system != "Windows"',
        'Flask',
        'numpy',
        'pandas',
        'python-dateutil',
        'protobuf>=3.6.0',
        'gitpython>=2.1.0',
        'pyyaml',
        'querystring_parser',
        'simplejson',
        'docker>=4.0.0',
        'entrypoints',
        'sqlparse',
        'sqlalchemy',
        'gorilla',
    ],
    extras_require={
        'extras':[
            "scikit-learn; python_version >= '3.5'",
            # scikit-learn 0.20 is the last version to support Python 2.x  & Python 3.4.
            "scikit-learn==0.20; python_version < '3.5'",
            'boto3>=1.7.12',
            'mleap>=0.8.1',
            'azure-storage',
            'google-cloud-storage',
        ],
    },

@AxeldeRomblay

This is initial attempt at hiding mlflow integration with mlbox.

This code fragment from the pytest module testing mlflow integration show the sequence of api calls

    space = {'ne__numerical_strategy': {"search": "choice", "space": [0]},
             'ce__strategy': {"search": "choice",
                              "space": ["label_encoding"]},
             'fs__threshold': {"search": "uniform",
                               "space": [0.01, 0.3]},
             'est__max_depth': {"search": "choice",
                                "space": [3, 4, 5, 6, 7]}

             }

    best = opt.optimise(space, dict, 4)

    # test for existence of data stored by mflow
    assert os.path.exists('./save/mlflow_tracking')
    assert os.path.exists('./save/mlflow_tracking/0')
    assert os.path.isfile('./save/mlflow_tracking/0/meta.yaml')
    assert len(os.listdir('./save/mlflow_tracking/0')) == 5

    # create pandas dataframe containing mflow captured data
    hyp_df = opt.extract_optimise_results()

    assert isinstance(hyp_df, pd.core.frame.DataFrame)
    assert hyp_df.shape == (4, 36)

    # save pandas dataframe
    hyp_df.to_csv('./save/mlflow_data.csv', index=False)
    assert os.path.isfile('./save/mlflow_data.csv')

This is the Excel workbook created from the cvs data extracted from the pandas dataframe of mlflow captured data from the optimise() run. mlflow_data.xlsx

@AxeldeRomblay

Here is a status update on the work.

I understand a key requirement is to hide the mlflow api from someone using mlbox. While this is possible, I believe it will be useful to provide visibility to mlflow concepts of experiments and runs.

Briefly, an 'experiment' is a collection of related 'runs', where a 'run' represents a single instance of running a ML algorithm with specific settings of its hyperparamters. In this sense opt.optimise(space, ...) equates to an experiment and the combination of hyperparamters represented by the space dictionary are the runs.

To support this design objective, I've modified the optimise() method method as follows:

def optimise(self, space, df, max_evals=40, record_experiment=None)

if record_experiment == None, then optimise() does not invoke any mlflow functions, it behaves as it currently does.

if record_experiment is set to a string, which represents the mlflow experiment name, then as each combination of hyperparameters spcified in the space dicitionary is recorded as an mlflow run.

A new method was added to the Optimiser object creates a pandas data frame that contains the run data from an optimise() experiment

def extract_optimise_results(self, experiment_name=None)

experiment_name is a string that must match the value of record_experiment in a prior run of optimise(..., record_experiment=...).

This attached zip file mlbox_with_mlflow.html.zip contains an html output from a jupyter notebook demonstrating these functions. The notebook is based on a Kaggle kernel you created that demonstrated mlbox.

AxeldeRomblay / MLBox

Proposal: Integrate mlflow tracking capability into mlbox to track hyper-parameter tuning results #85