Open jimthompson5802 opened 4 years ago
Key requirement is to provide backward compatibility.
Additional details on the proposal. The signature for the optimise()
method of the Optimiser()
class will change to be
def optimise(self, space, df, max_evals=40, mlflow_parms=None)
The new parameter mlflow_parms
will control mlflow behavior. The default for this parameter is None
. This provides backward compatibility, i.e. optimise()
will operate the same as before mlflow integration.
If mlflow function is desired, then the call to optimise()
will be look like this:
mlflow_parms = {'tracking_uri' : './mlruns', 'experiment_name' : 'mlfow_pytest_experiment'}
best = opt.optimise(space, dict, 4, mlflow_parms = mlflow_parms)
mlflow related parameters: "tracking_uri": mlflow tracking uri to for storing experiment/run data "experiment": mlflow experiment name
These are example output for a working prototype of integrating mlflow with mlbox:
space = {'ne__numerical_strategy': {"search": "choice", "space": [0]},
'ce__strategy': {"search": "choice",
"space": ["label_encoding"]},
'fs__threshold': {"search": "uniform",
"space": [0.01, 0.3]},
'est__max_depth': {"search": "choice",
"space": [3, 4, 5, 6, 7]}
}
mflow_parms = {'tracking_uri' : './mlruns',
'experiment_name' : 'mlfow_pytest_experiment'}
best = opt.optimise(space, dict, 4, mlflow_parms = mflow_parms)
Example of mlflow UI for results of the hyper-parameter tuning:
Example of details captured for a single run:
Example of comparing runs:
Sample chart generated from the hyperpamater tuning:
CSV extract of hyperparameter tuning results hyperparameter_runs.xlsx
Hello @jimthompson5802, Thanks for this nice feature ! It could be great, I will have a look. It would be better if the user does not set any parameters (mflow_params must have default values and be hidden from the user...). Also what about the mlflow requirements ? python versions ? dependencies with mlbox ? Thanks ! Axel
@AxeldeRomblay Thank you for the feedback.
I understand the you'd like to hide mlflow as much as possible from the user. This should be doable. One implication of this approach is that instead of using the mlflow web UI, which requires some knowledge of how mlflow works, to view results of the optimize()
run, we may need a new method to the Optimiser()
class to extract results into a pandas data frame, something like extract_optimise_results()
I'll proceed along these lines.
If you liked to see an early preview of this work, it is in my fork of your repo. It is in the branch integrate_mlflow
.
Re: Python requirements for mlflow. mlflow runs under Python 2.7 and Python3.x. This is the package dependencies from mlflow's setup.py
install_requires=[
'alembic',
'click>=7.0',
'cloudpickle',
'databricks-cli>=0.8.7',
'requests>=2.17.3',
'six>=1.10.0',
'waitress; platform_system == "Windows"',
'gunicorn; platform_system != "Windows"',
'Flask',
'numpy',
'pandas',
'python-dateutil',
'protobuf>=3.6.0',
'gitpython>=2.1.0',
'pyyaml',
'querystring_parser',
'simplejson',
'docker>=4.0.0',
'entrypoints',
'sqlparse',
'sqlalchemy',
'gorilla',
],
extras_require={
'extras':[
"scikit-learn; python_version >= '3.5'",
# scikit-learn 0.20 is the last version to support Python 2.x & Python 3.4.
"scikit-learn==0.20; python_version < '3.5'",
'boto3>=1.7.12',
'mleap>=0.8.1',
'azure-storage',
'google-cloud-storage',
],
},
@AxeldeRomblay
This is initial attempt at hiding mlflow integration with mlbox.
This code fragment from the pytest module testing mlflow integration show the sequence of api calls
space = {'ne__numerical_strategy': {"search": "choice", "space": [0]},
'ce__strategy': {"search": "choice",
"space": ["label_encoding"]},
'fs__threshold': {"search": "uniform",
"space": [0.01, 0.3]},
'est__max_depth': {"search": "choice",
"space": [3, 4, 5, 6, 7]}
}
best = opt.optimise(space, dict, 4)
# test for existence of data stored by mflow
assert os.path.exists('./save/mlflow_tracking')
assert os.path.exists('./save/mlflow_tracking/0')
assert os.path.isfile('./save/mlflow_tracking/0/meta.yaml')
assert len(os.listdir('./save/mlflow_tracking/0')) == 5
# create pandas dataframe containing mflow captured data
hyp_df = opt.extract_optimise_results()
assert isinstance(hyp_df, pd.core.frame.DataFrame)
assert hyp_df.shape == (4, 36)
# save pandas dataframe
hyp_df.to_csv('./save/mlflow_data.csv', index=False)
assert os.path.isfile('./save/mlflow_data.csv')
This is the Excel workbook created from the cvs data extracted from the pandas dataframe of mlflow captured data from the optimise()
run.
mlflow_data.xlsx
@AxeldeRomblay
Here is a status update on the work.
I understand a key requirement is to hide the mlflow api from someone using mlbox. While this is possible, I believe it will be useful to provide visibility to mlflow concepts of experiments and runs.
Briefly, an 'experiment' is a collection of related 'runs', where a 'run' represents a single instance of running a ML algorithm with specific settings of its hyperparamters. In this sense opt.optimise(space, ...)
equates to an experiment and the combination of hyperparamters represented by the space
dictionary are the runs.
To support this design objective, I've modified the optimise()
method
method as follows:
def optimise(self, space, df, max_evals=40, record_experiment=None)
if record_experiment == None
, then optimise()
does not invoke any mlflow functions, it behaves as it currently does.
if record_experiment
is set to a string, which represents the mlflow experiment name, then as each combination of hyperparameters spcified in the space
dicitionary is recorded as an mlflow run.
A new method was added to the Optimiser
object creates a pandas data frame that contains the run data from an optimise()
experiment
def extract_optimise_results(self, experiment_name=None)
experiment_name
is a string that must match the value of record_experiment
in a prior run of optimise(..., record_experiment=...)
.
This attached zip file mlbox_with_mlflow.html.zip contains an html output from a jupyter notebook demonstrating these functions. The notebook is based on a Kaggle kernel you created that demonstrated mlbox.
Integrate mlflow metric tracking capabilities with mlbox to track hyperparameter tuning results.
Specifically, use
mlflow
to record results fromopt.optimise(space, df)
, such as theseBy using mlflow to track these kind of results, we are able to use mlflow web tracking ui to review and report on those results. This capability is demonstrated in this mlflow talk.
Current proposal is to record each
### testing hyper-parameters...###
as mlflow experiments and runs that records the model algorithm, hyperparameter settings and resulting metric.Does this sound like a reasonable addition to mlbox capabilities?