Add Multilayer Perceptron Regression

What changes were proposed in this pull request?

This is a pull request adding support for Multilayer Perceptron Regression, the counterpart to the Multilayer Perceptron Classifier (hereafter MLPR and MLPC).

Outline

Major Changes
API Decisions
Automating Scaling
Features
Reference Resources
Major Changes

There are two major differences between MLPR and MLPC. The first is the use of an linear (identity) activation function and a sum of squared error cost function in the last layer of the network. The second is the requirement to scale the data to [0,1] and back to make it easy for the weights to fit a value in the proper range.

Linear, Relu, Tanh Activations

In the forward pass the linear activation passes the value from the fully connected layer through to become the network prediction. In weight adjustment during the backward pass its derivative is one. All regression models will use the linear activation in the last layer, and so there is no option (as there is in MLPC) to use another activation function and cost function in the last layer.

The Relu and Tanh are activation functions that will benefit the accuracy and convergence speed MLPC and MLPR. Tanh zero-centers the data passed to neurons which aids optimization. Relu avoids saturating the gradients.

Automated Scaling

The data scaling is done through min-max scaling, where the minimum label is subtracted from every value (leading to a range from [0 to max-min]) and then dividing by max-min to get a scale from 0 to 1. The corner case where max-min = 0 is resolved by omitting the division step.

Motivating Example

API Decisions

The API is identical to MLPC with the exception of softmaxOnTop - there is no option on the last layer activation function, or on the cost function to be used (MLPC gives a choice between cross entropy and sum of square error). This API has the user call MLPR with a set of layers that represent the topology of the network. The number of hidden layers is inferred by the parameter for the labels and is equal to the total number of layers - 2. Each hidden layer will be a feedforward layer with a sigmoid activation function up to the output layer and its linear activation.

Input/Output Layer Argument

For MLPR, the output count will usually be 1, and the number of inputs will always be equal to the number of features in the training dataset. One API choice could be to omit the input and output counts and only have the user supply the number of neurons in the hidden layers, and automate the input and output counts by looking at the training data. At the very least, it makes sense to validate the user’s layer parameter and display a helpful error message instead of the error in the data stacker that currently appears if the improper number of inputs or outputs is provided.

Modular API

It also would make sense for the API to be modular. A user will want the flexibility to use the linear layer at different points in the network (as well as in MLPC), and will certainly want to be able to use new activation functions (tanh, relu) that are added to improve the performance of these models. That flexibility allows a user to tune the network to their dataset and will be particularly important for convnets or recurrent nets in future. We should decide on the best way to enable tanh and relu activations in this algorithm and in the classifier for the time being.

Automating Scaling

Current behavior has an argument allowing the user to turn off scaling using MultilayerPerceptronRegressor().setStandardizeLabels(false).

In general I advocate for helpful defaults that can be overridden, where we scale automatically but give an option to run without scaling and don’t run autoscaling if both the min and max are provided by the user.

Features

There are a few features that have been checked:

Integrates cleanly with pipeline API
Model save/load is enabled
Reference Resources

Christopher M. Bishop. Neural Networks for Pattern Recognition. Patrick Nicolas. Scala for Machine Learning, Chapter 9. Ian Goodfellow Yoshua Bengio and Aaron Courville. Deep Learning, Chapter 6.

avulanov / scalable-deeplearning