conradbm / streamml

Streamlined Machine Learning
MIT License
0 stars 0 forks source link

Streamlined Machine Learning


Streamlined Machine Learning is best explained by describing it's structure. There exist three primary functions: 1. Transformation Preprocessing 2. Model Selection 3. Feature Selection. streamml contains a set of robust functions built on top of the sklearn framework and aims at streamlining all of the processes in the context of a flow.

Background

The three main classes in the streamml ecosystem are: TransformationStream, ModelSelectionStream, and FeatureSelectionStream. The underlying assumption before running any of these objects and their capabilities is that you have cleaned and completely preprocessed your data of all the nasty you would do before running any model or transformation in sklearn. All X and y data piped into these models must be a pandas.DataFrame, even your pandas.Series style y data (this simplifies and unifies functionality accross the ecosystem). That said, TransformationStream is constructed with X, then has the ability to flow through a cadre of different manifold, clustering, or transformation functions built into the ecosystem (which are explained the documentation in further detail). ModelSelectionStream is constructed with both X and y, where y can be categorical (binary or n-ary), then has the ability to flow through a cadre of different sklearn based model objects. The underlying assumption is that your y data has been categorized into a numeric representation, as this is how sklearn prefers it. We recommend you simply use pandas.factorize to accomplish this, but this is not done explicitely or implicitely for you. Lastly FeatureSelectionStream is constructed with both X and y, then has the ability to flow through a cadre of very specific types of model objects and ensemble functions. The models are ones in the sklearn packages that contain coef_, featureimportance, or p-values attributes produced after the hyper-tuning phase or running the model. As this is unintuitive at first, these include, but are not limited to: OLS p-values, Random Forest feature importance, or Lasso coefficients.

Lets Get Started

In the streamml ecosystem, as mentioned above, we must build a stream object. The idea is within this stream, we can flow through very specific objects that are optimized for us behind the scenes. Yup. That's it. All of the gridsearching and pipelining procedures you are use to doing everytime you see a dataset are already built in. Just construct a stream and then .flow([...]) right on through it, and it will return your hypertuned models, transformed feature-space, or a subspace of features that are most pronounced within your data. Streaming Capabilities provided:


Some Examples

Simple data set

X = pd.DataFrame(np.matrix([[np.random.exponential() for j in range(10)] for i in range(200)]))

y = pd.DataFrame(np.array([np.random.exponential() for i in range(200)]))

Supported stream operators: scale, normalize, boxcox, binarize, pca, kmeans, brbm (Bernoulli Restricted Boltzman Machine).

Xnew = TransformationStream(X).flow(
    ["scale","normalize","pca", "binarize", "boxcox", "kmeans", "brbm"],
    params={"pca__percent_variance":0.75, 
            "kmeans__n_clusters":2, 
            "binarize__threshold":0.5, 
            "brbm__n_components":X.shape[1], 
            "brbm__learning_rate":0.0001},
            verbose=True)

Regression

performances = ModelSelectionStream(Xnew,y).flow(
    ["svr", "lr", "knnr","lasso","abr","mlp","enet"],

    params={'svr__C':[1,0.1,0.01,0.001],

            'svr__gamma':[0, 0.01, 0.001, 0.0001],

            'svr__kernel':['poly', 'rbf'],

            'svr__epsilon':[0,0.1,0.01,0.001],

            'svr__degree':[1,2,3,4,5,6,7],

            'lr__fit_intercept':[False, True],

            'knnr__n_neighbors':[3, 5,7, 9, 11, 13],

            'lasso__alpha':[0, 0.1, 0.01,1,10.0,20.0],

            'ridge__alpha':[0, 0.1, 0.01,1,10.0,20.0],

            'enet__alpha':[0, 0.1, 0.01,1,10,20],

            'enet__l1_ratio':[.25,.5,.75],

            'abr__n_estimators':[10,20,50],

            'abr__learning_rate':[0.1,1,10, 100],

            'rfr__criterion':['mse', 'mae'],

            'rfr__n_estimators':[10,100,1000]}, 

    metrics=['r2','rmse', 'mse',
    'explained_variance','mean_absolute_error',
    'median_absolute_error'],
    verbose=True,
    regressors=True,
    cut=2)

Classification

performances = ModelSelectionStream(X2,y2).flow(
    ["abc"], 

    params={'abc__n_estimators':[10,100,1000],
    'abc__learning_rate':[0.001,0.01,0.1,1,10,100]},

    metrics=["auc",
            "prec",
            "recall",
            "f1",
            "accuracy",
            "kappa",
            "log_loss"],
    verbose=True,
    modelSelection=True,
    regressors=False
    )