This package contains a Machine which is meant to do the learning for you. It can automaticly create a fitting predictive model for given data.
Testing: Gradient Boosting Classifier
[########################################] | 100% Completed | 3.9s
Score: 0.9667
Testing: Ada Boost Classifier
[########################################] | 100% Completed | 1.3s
Score: 0.9600
Testing: Random Forest Classifier
[########################################] | 100% Completed | 5.0s
Score: 0.9600
Testing: Balanced Random Forest Classifier
[########################################] | 100% Completed | 3.5s
Score: 0.9600
Testing: SVC
[########################################] | 100% Completed | 1.2s
Score: 0.9667
Chosen model: Gradient Boosting Classifier 0.9667
Params:
min_samples_split: 2
n_estimators: 100
Results saved to output.csv
To use the package run:
pip install modelcreator
The input may be either a path to a csv file or a pandas DataFrame object.
The library assumes that the last column of the training dataset contains the expected results. The dataset (both training and predictive) must be provided as a csv file.
If the results column contains text the Machine will do its best to learn to classify the data correctly. In case of a number inside, regression will be performed.
If the file contains headers you shall add header_in_csv=True
parameter to the method.
from modelcreator import Machine
# Create automl machine instance
machine = Machine()
# Train machine learning model
machine.learn('example-data/iris.csv')
# Predict the outcomes
machine.predict('example-data/iris-pred.csv', 'output.csv')
This example is also available in the example.py
file. Consider trying it on your own.
But what to do if a result column is not the last in the given csv? It may be inconvenient to rewrite the whole csv just to swap the columns. Because of this problem Machine has learnFromDf
and predictFromDf
methods. The Df in method names stands for DataFrame from pandas module. This way you can handle reading the file by yourself.
from modelcreator import Machine
import pandas as pd
# Create DataFrame object from file
train = pd.read_csv("train.csv")
# Get features columns from DataFrame
X_train = train.drop(['Survived'], axis=1)
# And labels (results) column
y_train = train["Survived"].astype(str)
# Create the instance of Machine
machine = Machine()
# Train machine learning model
machine.learnFromDf(X_train, y_train, computation_level='advanced')
# Show parameters of the model
machine.showParams()
# Load test set from file
X_test = pd.read_csv("test.csv")
# Predict the labels
results = machine.predictFromDf(X_test)
# Save results to a new file
results.to_csv("results.csv")
Simple? That's right! Just note that we used astype(str)
in order to treat data as classes, not numbers because the Titanic dataset used in the example above has values 0 and 1 in "Survived"
column to indicate whether a person made it through the disaster.
If you want your model to avoid re-learning on the whole dataset just to make a simple prediction you can save the state of Machine to a file.
# Save Machine with a trained model to "machine.pkl"
machine.saveMachine('machine.pkl')
# Create a new machine based on a schema file
machine2 = Machine('machine.pkl')
The Machine can be customized according to the use case. Check the parameters table:
Param | Type | Default | Description |
---|---|---|---|
schema | None or str | None |
A Machine may be created based on a saved, pre-trained machine instance. You may specify the path to the saved instance in this param to recreate it. |
Param | Type | Default | Description |
---|---|---|---|
dataset_file | str | Path to a csv file which contains training dataset. | |
header_in_csv | bool | False |
Whether the csv file contains headers in the first row. |
metrics | None, str or Callable | 'accuracy' or 'neg_root_mean_squared_error' |
Metrics used for scoring estimators. Many popular scoring functions (such as f1, _rocauc, _neg_mean_gammadeviance). See here how to make custom scoring functions. |
verbose | bool | True |
Whether to print learning logs. |
cv | int | 3 |
a Number of cross-validation subsets. Higher values may increase computation time. |
computation_level | str | 'medium' |
Can be either 'basic' , 'medium' or 'advanced' . With higher computation level more models and parameters are being tested. |
Param | Type | Default | Description |
---|---|---|---|
X | pandas.DataFrame | DataFrame containing the feature columns. | |
y | pandas.Series | Label columns of the training data. | |
metrics | None, str or Callable | 'accuracy' or 'neg_root_mean_squared_error' |
Metrics used for scoring estimators. Many popular scoring functions (such as f1, _rocauc, _neg_mean_gammadeviance). See here how to make custom scoring functions. |
verbose | bool | True |
Whether to print learning logs. |
cv | int | 3 |
A number of cross-validation subsets. Higher values may increase computation time. |
computation_level | str | 'medium' |
Can be either 'basic' , 'medium' or 'advanced' . With higher computation level more models and parameters are being tested. |
Param | Type | Default | Description |
---|---|---|---|
features_file | str | Path to the features csv of the data to generate predictions on. | |
header_in_csv | bool | False |
Whether the csv file contains headers in the first row. |
output_file | str | 'output.csv' |
Path to the output csv file. In this file, the predictions will be saved. |
verbose | str | True |
Whether to print logs. |
Param | Type | Default | Description |
---|---|---|---|
X_predictions | pandas.DataFrame | Features columns to generate predictions on. | |
output_file | str | None |
Predict method returns pandas.Series of the results. Additionally, it can also save the results to a csv file. It can be specified here. If the path is other than None it will be interpreted as a path to the output file. |
verbose | str | True |
Whether to print logs. |
Param | Type | Default | Description |
---|---|---|---|
output_file_name | str | 'machine.pkl' |
Path to where shall the Machine instance be saved. |
Have a feature idea or just want to help? Take a look at the issues tab!