ML training and prediction modules

Purpose: Rebase and explore the ML elements for training and prediction.

Notes:

rebased all ML components
default parameter values stored in YAML file in package data and accompany parent class
archival capability for trained models
added new prediction functionality
switched from standardization to normalization for data prep
switched to IQR approach with scale constant of 3.5 to be very permissive yet controlled for outlier detection and exclusion
previous tell.predict() is now tell.train() and tell.predict() is actually used to generate demand predictions
include parallel batch processing functionality for both training and prediction functions
performance improvements
etc.

Usage:

At a bare minimum, train one region with all defaults. This does not save the models by default, but returns the prediction data frame and performance data frame:

import tell

pdf, vdf = tell.train(region="PJM",
                      data_dir="<my data directory>")

We can also control our parameters externally to override the defaults:

import tell 

pdf, vdf = tell.train(region="PJM",
                      data_dir="<my data directory>",
                      start_time="2016-01-01 00:00:00",
                      end_time="2019-12-31 23:00:00",
                      split_datetime="2018-12-31 23:00:00",
                      mlp_linear_adjustment=True,
                      save_model=True)

We can also do this for all BAs in parallel:

import tell

# generate a list of balancing authority (BA) abbreviations to process
ba_abbrev_list = tell.get_balancing_authority_to_model_dict().keys()

# process all BAs in list (this one runs over 4 jobs)
pdf, vdf = tell.train_batch(target_region_list=ba_abbrev_list,
                            data_dir=data_dir,
                            start_time=start_time,
                            end_time=end_time,
                            split_datetime=split_datetime,
                            mlp_linear_adjustment=True,
                            save_model=True,
                            n_jobs=4)

Prediction works in a similar way; they just take in a year as an additional required argument and return only a predicted data frame. Here is one for a single region:

import tell

pdf = tell.predict(region="PJM",
                   year=2039,
                   data_dir="<my data directory>",
                   datetime_field_name="Time_UTC",
                   mlp_linear_adjustment=True)

And prediction in parallel:

import tell

# generate a list of balancing authority (BA) abbreviations to process
ba_abbrev_list = tell.get_balancing_authority_to_model_dict().keys()

# process all BAs in list
pdf = tell.predict_batch(target_region_list=ba_abbrev_list,
                         year=2039,
                         data_dir="<my data directory>",,
                         mlp_linear_adjustment=True,
                         n_jobs=4)

All of the default values are stored in tell/data/mlp_settings.yml which looks like this:

# default settings for the MLP module of TELL

# The ith element represents the number of neurons in the ith hidden layer.
mlp_hidden_layer_sizes: 256

# Maximum number of iterations. The solver iterates until convergence
# (determined by ‘tol’) or this number of iterations. For stochastic solvers
# (‘sgd’, ‘adam’), note that this determines the number of epochs (how many
# times each data point will be used), not the number of gradient steps.
mlp_max_iter: 500

# The proportion of training data to set aside as validation set for early
# stopping. Must be between 0 and 1.
mlp_validation_fraction: 0.1

# True if you want to correct the MLP model using a linear model.
mlp_linear_adjustment: True

# True if setting up data for a linear model that will be run and will cause
# the application of the sine function for hour and month fields if they
# are present in the data.
apply_sine_function: False

# Dictionary for the field names present in the input CSV file (keys) to what the
# code expects them to be (values).
data_column_rename_dict: {
    "Adjusted_Demand_MWh": "Demand",
    "Total_Population": "Population",
    "T2": "Temperature",
    "SWDOWN": "Shortwave_Radiation",
    "GLW": "Longwave_Radiation",
    "WSPD": "Wind_Speed",
    "Q2": "Specific_Humidity"
}

# Expected names of the date time columns in the input CSV file.
expected_datetime_columns: ["Day", "Month", "Year", "Hour"]

# Field name of the hour field in the input CSV file.
hour_field_name: "Hour"

# Field name of the month field in the input CSV file.
month_field_name: "Month"

# Field name of the year field in the input CSV file.
year_field_name: "Year"

# Target variable list.
x_variables: ["Hour", "Month", "Temperature", "Specific_Humidity", "Wind_Speed", "Longwave_Radiation", "Shortwave_Radiation"]

# True if the user wishes to add weekday and holiday targets to the x variables.
add_dayofweek_xvars: True

# Feature variable list.
y_variables: ["Demand"]

# List of day abbreviations and their order.
day_list: ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

# Timestamp showing the datetime of for the run to start
start_time: "2016-01-01 00:00:00"

# Timestamp showing the datetime of for the run to end
end_time: "2019-12-31 23:00:00"

# Timestamp showing the datetime to split the train and test data by
split_datetime: "2018-12-31 23:00:00"

# Seed value to reproduce randomization.
seed_value: 391

# Target variable list for the linear model.
x_variables_linear: ["Population", "Hour", "Month", "Year"]

# Feature variable list for the linear model.
y_variables_linear: ["Demand"]

# Choice to write ML models to a pickled file via joblib.
save_model: False

# Full path to output directory where model file will be written. Default uses package data.
model_output_directory: "Default"

# Choice to see logged outputs.
verbose: False

If you need help on remembering what a function can take, you can just call help on it:

import tell

help(tell.predict)

which returns:

Help on function predict in module tell.mlp_predict:

predict(region: str, year: int, data_dir: str, datetime_field_name: str = 'Time_UTC', **kwargs)
    Generate predictions for MLP model for a target region from an input CSV file.

    :param region:                      Indicating region / balancing authority we want to train and test on.
                                        Must match with string in CSV files.
    :type region:                       str

    :param year:                        Target year to use in YYYY format.
    :type year:                         int

    :param data_dir:                    Full path to the directory that houses the input CSV files.
    :type data_dir:                     str

    :param datetime_field_name:         Name of the datetime field.
    :type datetime_field_name:          str

    :param mlp_linear_adjustment:       True if you want to correct the MLP model using a linear model.
    :type mlp_linear_adjustment:        Optional[bool]

    :param apply_sine_function:         True if setting up data for a linear model that will be run and will cause
                                        the application of the sine function for hour and month fields if they
                                        are present in the data.
    :type apply_sine_function:          Optional[bool]

    :param data_column_rename_dict:     Dictionary for the field names present in the input CSV file (keys) to what the
                                        code expects them to be (values).
    :type data_column_rename_dict:      Optional[dict[str]]

    :param expected_datetime_columns:   Expected names of the date time columns in the input CSV file.
    :type expected_datetime_columns:    Optional[list[str]]

    :param hour_field_name:             Field name of the hour field in the input CSV file.
    :type hour_field_name:              Optional[str]

    :param month_field_name:            Field name of the month field in the input CSV file.
    :type month_field_name:             Optional[str]

    :param x_variables:                 Target variable list.
    :type x_variables:                  Optional[list[str]]

    :param add_dayofweek_xvars:         True if the user wishes to add weekday and holiday targets to the x variables.
    :type add_dayofweek_xvars:          Optional[bool]

    :param y_variables:                 Feature variable list.
    :type y_variables:                  Optional[list[str]]

    :param day_list:                    List of day abbreviations and their order.
    :type day_list:                     Optional[list[str]]

    :param seed_value:                  Seed value to reproduce randomization.
    :type seed_value:                   Optional[int]

    :param x_variables_linear:          Target variable list for the linear model.
    :type x_variables_linear:           Optional[list[str]]

    :param y_variables_linear:          Feature variable list for the linear model.
    :type y_variables_linear:           Optional[list[str]]

    :param verbose:                     Choice to see logged outputs.
    :type verbose:                      bool

    :return:                            Prediction data frame

IMMM-SFA / tell

ML training and prediction modules #21