EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.69k stars 1.57k forks source link

Exporting pipelines to PMML/PFA #152

Open jln-ho opened 8 years ago

jln-ho commented 8 years ago

I'm currently doing research in the area of model/pipeline persistence and came across PMML. It's basically an XML schema that lets you define all sorts of data mining/machine learning processes for both persistence and interoperability. It was specifically designed to decouple the tools that are used to generate pipelines from the tools that are used to apply them.

There are several libraries by openscoring.io that can be used to export pipelines from popular environments such as sklearn, Apache Spark MLlib, R or XGBoost. Their counterparts allow for evaluation (i.e. execution) of said exported pipelines in e.g. plain Java, in a Spark context, an Android context or even in a database context such as PostgreSQL. The most interesting library for TPOT in particular should be sklearn2pmml, which is a python wrapper around jpmml-sklearn that converts pickled pipelines to PMML and is written in Java (talk about dependency hell).

I think that PMML (or its successor in the making PFA) would be a great format to use for persisting pipelines generated with TPOT as most people using TPOT will want to deploy the models "found" by it to some other platform. At least there seems to be some sort of demand for persisting models in general according to these issues #2, #11, #51, #65. Some of them suggest using python's pickle format, but I think a dedicated, platform independent solution should always be preferred. Not to talk about the security issues that come with pickle, that's a whole other story.

Excited to hear your thoughts on this!

rhiever commented 8 years ago

Interesting idea. How popular is this format? XML seems a bit outdated. (I thought JSON replaced XML everywhere.)

jln-ho commented 8 years ago

The format was first specified in 1998, hence the use of XML. PFA, PMML's successor, will be based on JSON and will provide quite a lot more flexibility. It's almost a high-level programming language actually. However, it is still in the making (IIRC the first draft of the specification was published towards the end of last year) and therefore not really being used in production by more than a handful of people.

PMML, on the other hand, is being used quite a lot, and I don't think it's going away even if PFA picks up speed over the next couple years. Here is a list of projects/frameworks that use/support PMML (the list can't be complete, though, because there is no mention of Pattern).

rhiever commented 8 years ago

That makes sense about XML then. :-)

This is something that we'll have to explore more before committing to. I'm primarily interested in finding out:

1) Can it support the arbitrary pipeline structures that TPOT may create?

2) Does it support all operators in sklearn?

3) Is there any nice visualization software that can read and visualize ML pipelines in PMML? (Related to #51)

jln-ho commented 8 years ago

I guess 1) and 2) can be answered with a straight no for PMML, although I think it would still be beneficial to support it with only a subset of TPOT pipelines being actually exportable. PFA should meet the requirements, though, at least on paper. I can't say how easy or hard it will be to come up with a generic way of exporting an intricate TPOT pipeline to PFA without manual work. As of now the only open source PFA tool I could find, namely Hadrian/Titus, doesn't really focus on converting models from existing frameworks, but rather on standalone model construction (example).

As for 3), I haven't been able to find any visualization tools so far. There are a few mentions of a tool that was developed during a research project, but it seems like it was never made available to the public...

Seems to me that we may have to wait a while for better PFA tooling to emerge in order to meet all the requirements.

rhiever commented 8 years ago

Agreed. In the meantime, we're still interested in #51 and being able to export TPOT pipelines to Orange. Seems like that would be quite useful.

vruusmann commented 8 years ago

As a contributor to JPMML-family of projects, here's my perspective.

Fundamentally, PMML is about capturing the final state of a model development workflow (not the workflow itself). In other words, the final state is the winning solution, which is appropriate for scoring data in production environment (where the goal is not to replay model development procedure, but to apply the model to new data records).

I think PFA is also more about capturing the state rather than the process. So, you should be focusing more on domain-specific languages that address workflow markup.

JPMML-SkLearn supports the conversion of over 50 Scikit-Learn transformers and estimators. The only class that couldn't be represented in "native" PMML (but could be put in action using Java user-defined function) has been sklearn.decomposition.NMF. So, if TPOT is relying on regular Scikit-Learn classes, then it's probably fairly straightforward to implement a converter to it.

Anyway, if you have any pointers for getting started with TPOT (something to do with the Iris dataset?), then I wouldn't mind giving a shot at implementing some PMML interoperability.

rhiever commented 8 years ago

Thank you @vruusmann! TPOT is built almost entirely on top of sklearn classes, with the exception of XGBoost and a couple custom feature preprocessors. However, we are dropping XGBoost due to installation issues for many users, so that will not be an issue in the next version.

Here are some resources that will help you get started with TPOT:

jln-ho commented 8 years ago

giving a shot at implementing some PMML interoperability.

@vruusmann that would be great! Thanks in advance for the effort.

vruusmann commented 8 years ago

You could check out WhizzML, which is oriented towards representing ML workflows. Of course, WhizzML is a very new thing, and by supporting it you would become interoperable only with BigML's platform.

But it should be interesting reference material nonetheless.

jln-ho commented 8 years ago

@rhiever are you sure that commit fixes this issue or was that a typo?

rhiever commented 8 years ago

That did appear to be a typo.

esanchezSavvyds commented 6 years ago

Hi @rhiever @vruusmann @jln-ho , I'm trying to export my tpot model to a PMML xml using jpmml-sklearn but I'm getting crazy to do it. Is there any option to do it? If not, how can I export my tpot model to use it later in Java? Thank you all, I'm looking forward to your response.

rhiever commented 6 years ago

If you use TPOT's export function, you can export the code to a scikit-learn pipeline. From there, whatever process you use to convert scikit-learn pipelines to PMML xml should work fine (as long as it supports the scikit-learn Pipeline object).

esanchezSavvyds commented 6 years ago

@rhiever Thank you very much for your time. I'm new to this world. How can I do that? With the export function I create .py file. In that file there is a read_csv function which I don´t understand what does. How can I get the scikit-learn pipeline from that file?

weixuanfu commented 6 years ago

@esanchezSavvyds read_csv function is for reading input dataset, you may need change the file path string 'PATH/TO/DATA/FILE' in that function to the dataset path and also need change COLUMN_SEPARATOR based on the dataset.

You may find a line in the .py file that starts with exported_pipeline, which is the scikit-learn pipeline. For example:

exported_pipeline = make_pipeline(
    SelectPercentile(score_func=f_classif, percentile=65),
    DecisionTreeClassifier(criterion="gini", max_depth=7, min_samples_leaf=4, min_samples_split=18)
)
esanchezSavvyds commented 6 years ago

Thank you for your fast answer @weixuanfu . My aim is to automatize the model generation with tpot and export it to PMML automatically. For that reason, I need to access to the scikit-learn pipeline at execution time in my code. Is that possible or the only possibility is to access to it through that file manually?

weixuanfu commented 6 years ago

@esanchezSavvyds Please check the TPOT API. I think the fitted_pipeline_ attribute (e.g. tpot_object.fitted_pipeline_) is what you need.

esanchezSavvyds commented 6 years ago

@weixuanfu I apologize for not seeing that earlier. That's what I need. Thank you very much for your job.

weixuanfu commented 6 years ago

@esanchezSavvyds Cool, good to know it solved the issue. No need to apologize.

esanchezSavvyds commented 6 years ago

@weixuanfu Problems have returned. I'm having problems at exporting the pipeline to PMML when tpot generates a model using StackingEstimator as it says it's not a supported transformation. What can I do? Is there a possibility to not use it? Thank you

rhiever commented 6 years ago

It's probable that the StackingEstimator is not supported in PMML, as it's a custom class outside of scikit-learn that's implemented by @rasbt. PMML would need to be extended to support estimator stacking.

You can look at alternatives to the best-scoring pipeline in the pareto_front_fitted_pipelines_ attribute of TPOT, accessed with tpot_object.pareto_front_fitted_pipelines_. That attribute should contain multiple possible solution pipelines for your problem, ranging from more complex and high-scoring to less complex and slightly-lower-scoring. Perhaps one of the less complex pipelines won't have the StackingEstimator.

rasbt commented 6 years ago

I am happy about PRs to the StackingClassifier & StackingCVClassifier fixing this; however, it sounds like it's rather due to PMML? In that case, I am curious, does the VotingClassifier from scikit-learn work? I implemented the VotingClassifier quite similarly (however, it does not have the level-2/meta estimator).

weixuanfu commented 6 years ago

Old version (< v0.7.3) of TPOT used VotingClassifier but it caused the issue #457 for stacking regressor in TPOTRegressor so we added to StackingEstimator for solving this issue.

@esanchezSavvyds could you please try to export the pipeline below to PMML?

exported_pipeline = make_pipeline(
    make_union(
        make_union(VotingClassifier([('branch',
            DecisionTreeClassifier(criterion="gini", max_depth=8, min_samples_leaf=5, min_samples_split=5)
        )]), FunctionTransformer(lambda X: X)),
        SelectKBest(score_func=f_classif, k=20)
    ),
    KNeighborsClassifier(n_neighbors=10, p=1, weights="uniform")
)
esanchezSavvyds commented 6 years ago

@weixuanfu @rasbt I have tried to export that pipeline to PMML. I have removed the FunctionTransformer because I can't pickle it. The issue I'm having when I try it is that FeatureUnion is not supported. The specification of the JAVA API I'm using to export to PMML says VotingClassifier is supported. Here you have it https://github.com/jpmml/jpmml-sklearn. Thank you all

rhiever commented 6 years ago

That's too bad. Perhaps we can add an option (or series of flags) to disable features such as stacking and pipeline splitting (i.e., CombineDFs). Disabling those features should then make the TPOT pipelines PMML compliant.

vruusmann commented 6 years ago

First, the FeatureUnion meta-transformation has been supported for over a year or so. I'm afraid that @esanchezSavvyds is simply using an outdated sklearn2pmml package version.

Second, I intend to catch up with TPOT-specific estimator types sometimes in December. You can track my progress by subscribing to this issue: https://github.com/jpmml/jpmml-sklearn/issues/54

As a workaround, it might be worthwhile to define a PMML-specific configuration dictionary (argument config_dict = "TPOT pmml" to TPOT estimator types), which restricts the use of estimator and transformer types that are currently not convertible to PMML. However, this configuration dictionary should be maintained by the SkLearn2PMML/JPMML-SkLearn projects, because it should not be TPOT's concern long-term?

rhiever commented 6 years ago

Yes, that was my thinking as well, @vruusmann. The issue is that stacking and pipeline splitting are not currently configurable in TPOT configuration dictionaries; they are always on by default. Hence my suggestion to add options to turn them off.

I agree that it would be wise for SkLearn2PMML/JPMML-SkLearn to maintain a TPOT configuration dictionary that is 100% PMML compliant. We could document the location of that configuration dictionary in the TPOT docs and point to the corresponding SkLearn2PMML/JPMML-SkLearn docs page.

esanchezSavvyds commented 6 years ago

@vruusmann I am using the last version of the java command line application and this versions:

And we are getting that FeatureUnion is not supported with this error:

Failed to convert java.lang.IllegalArgumentException: The estimator object (Python class sklearn.pipeline.FeatureUnion) is not an Estimator or is not a supported Estimator subclass at sklearn.EstimatorUtil$1.apply(EstimatorUtil.java:90) at sklearn.EstimatorUtil$1.apply(EstimatorUtil.java:78) at sklearn.EstimatorUtil.asEstimator(EstimatorUtil.java:42) at sklearn2pmml.PMMLPipeline.getEstimator(PMMLPipeline.java:216) at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:73) at org.jpmml.sklearn.Main.run(Main.java:144) at org.jpmml.sklearn.Main.main(Main.java:93) Caused by: java.lang.ClassCastException: sklearn.pipeline.FeatureUnion cannot be cast to sklearn.Estimator at sklearn.EstimatorUtil$1.apply(EstimatorUtil.java:88)

vruusmann commented 6 years ago

@rhiever @weixuanfu Here's a possible workflow for automatically generating PMML-compatible TPOT configuration dictionaries: https://github.com/jpmml/jpmml-sklearn/issues/55

vruusmann commented 6 years ago

@esanchezSavvyds You're using the FeatureUnion transformation type in a context which requires an estimator type. Specifically, feature union cannot be the last step of a pipeline.

weixuanfu commented 6 years ago

@esanchezSavvyds For the pickable issue, you may try to use 'copy.copy' instead of the lambda function used in old version of TPOT.

from copy import copy
exported_pipeline = make_pipeline(
    make_union(
        make_union(VotingClassifier([('branch',
            DecisionTreeClassifier(criterion="gini", max_depth=8, min_samples_leaf=5, min_samples_split=5)
        )]), FunctionTransformer(copy)),
        SelectKBest(score_func=f_classif, k=20)
    ),
    KNeighborsClassifier(n_neighbors=10, p=1, weights="uniform")
)
esanchezSavvyds commented 6 years ago

@vruusmann I think the problem is not creating a compatible config directory, which I agree it has to be done. The problem is that, as far as I can understand, StackingEstimator cannot be disabled in tpot and it's not supported by the sklearn2pmml.

weixuanfu commented 6 years ago

@esanchezSavvyds One of my dev branch of TPOT called noCDF_noStacking has a option named simple_pipeline, which can disable both StackingEstimator and CombineDFs if simple_pipeline=True (e.g. TPOTClassifier(simple_pipeline=True)). But it is noted that this dev branch is not fully tested yet. If you want to try TPOT without StackingEstimator and FeatureUnion, you may install this branch in your test environment via the command below:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/weixuanfu/tpot.git@noCDF_noStacking
esanchezSavvyds commented 6 years ago

Hi @weixuanfu , first of all, thank you very much for your effort. We are testing this feature and we have found that ZeroCount transformation is neither supported. This is the error:

java.lang.IllegalArgumentException: The transformer object (Python class tpot.builtins.zero_count.ZeroCount) is not a Transformer or is not a supported Transformer subclass

vruusmann commented 6 years ago

@esanchezSavvyds You should list all "problematic" TPOT estimator and transformer types here: https://github.com/jpmml/jpmml-sklearn/issues/54

weixuanfu commented 6 years ago

@esanchezSavvyds you may want to use a configuration dictionary for excluding ZeroCount and XGBClassififer from the default dictionary and pass it to config_dict parameter in TPOT, in order to avoid these operators that PMML do not supported so far. Please check the example in this link

esanchezSavvyds commented 6 years ago

@weixuanfu we have been testing your feature and we think it works correctly, at least it works correctly for us. We have been able to export every pipeline without any problem (using an appropiate configuration dictionary as you have said).

weixuanfu commented 6 years ago

@esanchezSavvyds OK, sounds good!

esanchezSavvyds commented 6 years ago

Hello @weixuanfu , We have been using your "noCDF_noStacking" feature during this time and we didn´t have any problem. We would like to know if you have planned to finish this branch and deploy it to the master branch.

Thank you very much.

weixuanfu commented 6 years ago

@esanchezSavvyds It is good to know that it works for you. That branch is one of my test branches to test the performance of using the simple linear pipelines in TPOT vs. the tree-based ones. We need more tests and discussions before deciding to merge this branch to master branch.

esanchezSavvyds commented 6 years ago

Hello @weixuanfu , Sorry for asking it again but, do you know more or less when your "noCDF_noStacking" branch will be merged with the master branch? It fits perfectly in our work, but now it's a lot of commits behind the master branch. Thank you very much :) .

vruusmann commented 5 years ago

I've refactored TPOT support in SkLearn2PMML package version 0.46.0 (available in PyPI), and explained some technical details/gotchas in the following technical article: https://openscoring.io/blog/2019/06/10/converting_sklearn_tpot_pipeline_pmml/

TLDR: TPOT fitted pipelines convert very nicely into PMML data format.

MONTYYUAN commented 4 years ago

Hi @weixuanfu @rhiever @rasbt @vruusmann, When the exported pipeline contains "from sklearn.preprocessing import Normalizer", TPOT fitted pipeline cannot be converted into PMML format, as sklearn2pmml package does not support it. SkLearn2PMML/JPMML-SkLearn How can I solve it ?Is it possible to remove or limit the function like sklearn.preprocessing.Normalizer ?

weixuanfu commented 4 years ago

@MONTYYUAN Please check the function about customizing TPOT's operators.