IMMM-SFA / tell

A model to predict Total ELectricity Loads (TELL)
https://immm-sfa.github.io/tell/
BSD 2-Clause "Simplified" License
26 stars 10 forks source link

ML training and prediction modules #21

Closed crvernon closed 2 years ago

crvernon commented 2 years ago

Purpose: Rebase and explore the ML elements for training and prediction.

Notes:

Usage:

At a bare minimum, train one region with all defaults. This does not save the models by default, but returns the prediction data frame and performance data frame:

import tell

pdf, vdf = tell.train(region="PJM",
                      data_dir="<my data directory>")

We can also control our parameters externally to override the defaults:

import tell 

pdf, vdf = tell.train(region="PJM",
                      data_dir="<my data directory>",
                      start_time="2016-01-01 00:00:00",
                      end_time="2019-12-31 23:00:00",
                      split_datetime="2018-12-31 23:00:00",
                      mlp_linear_adjustment=True,
                      save_model=True)

We can also do this for all BAs in parallel:

import tell

# generate a list of balancing authority (BA) abbreviations to process
ba_abbrev_list = tell.get_balancing_authority_to_model_dict().keys()

# process all BAs in list (this one runs over 4 jobs)
pdf, vdf = tell.train_batch(target_region_list=ba_abbrev_list,
                            data_dir=data_dir,
                            start_time=start_time,
                            end_time=end_time,
                            split_datetime=split_datetime,
                            mlp_linear_adjustment=True,
                            save_model=True,
                            n_jobs=4)

Prediction works in a similar way; they just take in a year as an additional required argument and return only a predicted data frame. Here is one for a single region:

import tell

pdf = tell.predict(region="PJM",
                   year=2039,
                   data_dir="<my data directory>",
                   datetime_field_name="Time_UTC",
                   mlp_linear_adjustment=True)

And prediction in parallel:

import tell

# generate a list of balancing authority (BA) abbreviations to process
ba_abbrev_list = tell.get_balancing_authority_to_model_dict().keys()

# process all BAs in list
pdf = tell.predict_batch(target_region_list=ba_abbrev_list,
                         year=2039,
                         data_dir="<my data directory>",,
                         mlp_linear_adjustment=True,
                         n_jobs=4)

All of the default values are stored in tell/data/mlp_settings.yml which looks like this:

# default settings for the MLP module of TELL

# The ith element represents the number of neurons in the ith hidden layer.
mlp_hidden_layer_sizes: 256

# Maximum number of iterations. The solver iterates until convergence
# (determined by ‘tol’) or this number of iterations. For stochastic solvers
# (‘sgd’, ‘adam’), note that this determines the number of epochs (how many
# times each data point will be used), not the number of gradient steps.
mlp_max_iter: 500

# The proportion of training data to set aside as validation set for early
# stopping. Must be between 0 and 1.
mlp_validation_fraction: 0.1

# True if you want to correct the MLP model using a linear model.
mlp_linear_adjustment: True

# True if setting up data for a linear model that will be run and will cause
# the application of the sine function for hour and month fields if they
# are present in the data.
apply_sine_function: False

# Dictionary for the field names present in the input CSV file (keys) to what the
# code expects them to be (values).
data_column_rename_dict: {
    "Adjusted_Demand_MWh": "Demand",
    "Total_Population": "Population",
    "T2": "Temperature",
    "SWDOWN": "Shortwave_Radiation",
    "GLW": "Longwave_Radiation",
    "WSPD": "Wind_Speed",
    "Q2": "Specific_Humidity"
}

# Expected names of the date time columns in the input CSV file.
expected_datetime_columns: ["Day", "Month", "Year", "Hour"]

# Field name of the hour field in the input CSV file.
hour_field_name: "Hour"

# Field name of the month field in the input CSV file.
month_field_name: "Month"

# Field name of the year field in the input CSV file.
year_field_name: "Year"

# Target variable list.
x_variables: ["Hour", "Month", "Temperature", "Specific_Humidity", "Wind_Speed", "Longwave_Radiation", "Shortwave_Radiation"]

# True if the user wishes to add weekday and holiday targets to the x variables.
add_dayofweek_xvars: True

# Feature variable list.
y_variables: ["Demand"]

# List of day abbreviations and their order.
day_list: ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

# Timestamp showing the datetime of for the run to start
start_time: "2016-01-01 00:00:00"

# Timestamp showing the datetime of for the run to end
end_time: "2019-12-31 23:00:00"

# Timestamp showing the datetime to split the train and test data by
split_datetime: "2018-12-31 23:00:00"

# Seed value to reproduce randomization.
seed_value: 391

# Target variable list for the linear model.
x_variables_linear: ["Population", "Hour", "Month", "Year"]

# Feature variable list for the linear model.
y_variables_linear: ["Demand"]

# Choice to write ML models to a pickled file via joblib.
save_model: False

# Full path to output directory where model file will be written. Default uses package data.
model_output_directory: "Default"

# Choice to see logged outputs.
verbose: False

If you need help on remembering what a function can take, you can just call help on it:

import tell

help(tell.predict)

which returns:

Help on function predict in module tell.mlp_predict:

predict(region: str, year: int, data_dir: str, datetime_field_name: str = 'Time_UTC', **kwargs)
    Generate predictions for MLP model for a target region from an input CSV file.

    :param region:                      Indicating region / balancing authority we want to train and test on.
                                        Must match with string in CSV files.
    :type region:                       str

    :param year:                        Target year to use in YYYY format.
    :type year:                         int

    :param data_dir:                    Full path to the directory that houses the input CSV files.
    :type data_dir:                     str

    :param datetime_field_name:         Name of the datetime field.
    :type datetime_field_name:          str

    :param mlp_linear_adjustment:       True if you want to correct the MLP model using a linear model.
    :type mlp_linear_adjustment:        Optional[bool]

    :param apply_sine_function:         True if setting up data for a linear model that will be run and will cause
                                        the application of the sine function for hour and month fields if they
                                        are present in the data.
    :type apply_sine_function:          Optional[bool]

    :param data_column_rename_dict:     Dictionary for the field names present in the input CSV file (keys) to what the
                                        code expects them to be (values).
    :type data_column_rename_dict:      Optional[dict[str]]

    :param expected_datetime_columns:   Expected names of the date time columns in the input CSV file.
    :type expected_datetime_columns:    Optional[list[str]]

    :param hour_field_name:             Field name of the hour field in the input CSV file.
    :type hour_field_name:              Optional[str]

    :param month_field_name:            Field name of the month field in the input CSV file.
    :type month_field_name:             Optional[str]

    :param x_variables:                 Target variable list.
    :type x_variables:                  Optional[list[str]]

    :param add_dayofweek_xvars:         True if the user wishes to add weekday and holiday targets to the x variables.
    :type add_dayofweek_xvars:          Optional[bool]

    :param y_variables:                 Feature variable list.
    :type y_variables:                  Optional[list[str]]

    :param day_list:                    List of day abbreviations and their order.
    :type day_list:                     Optional[list[str]]

    :param seed_value:                  Seed value to reproduce randomization.
    :type seed_value:                   Optional[int]

    :param x_variables_linear:          Target variable list for the linear model.
    :type x_variables_linear:           Optional[list[str]]

    :param y_variables_linear:          Feature variable list for the linear model.
    :type y_variables_linear:           Optional[list[str]]

    :param verbose:                     Choice to see logged outputs.
    :type verbose:                      bool

    :return:                            Prediction data frame