BYU-PRISM / GEKKO

GEKKO Python for Machine Learning and Dynamic Optimization
https://machinelearning.byu.edu
Other
594 stars 104 forks source link

Increasing the number of independent variables for bspline and cspline #163

Open Florent-H opened 1 year ago

Florent-H commented 1 year ago

The number of independent variables for the bspline method of the GEKKO class is only two, while for the cspline method it is only one. I am making a feature request of increasing the number of independent variables of these methods to an arbitrary number (but perhaps less than, e.g., 1000 to remain within reason). This increase would make the methods much more useful in many machine learning contexts.

Thank you for considering this request. I am not a developer, but I can try to contribute to a pull request if/when this feature becomes a priority for GEKKO developers.

APMonitor commented 1 year ago

Thanks for the feature request. Could you point to example code that uses 3+ independent variables? The scipy.interpolate functions have 2 independent variables. It is helpful to have an alternative code as a benchmark for the development.

Florent-H commented 1 year ago

B-splines

The BSplines class from statsmodels creates B-spline basis functions using Cox-de Boor recursion (page 4), where the independent variable x can be 2D.

Data generation for all examples in this post

import numpy as np
import matplotlib.pyplot as plt
n_pts = 500
# scaling factor for noise in data (std of Gaussian distribution)
noise_scl = 0.1
# create non-linear response
y = np.sin(np.arange(n_pts) / 150) * np.cos(
    np.arange(n_pts) / 75
) + np.random.normal(loc=0, scale=noise_scl, size=n_pts)
# create independent variables
x0 = np.cos(np.arange(n_pts) / 75)
x1 = np.cos(np.arange(n_pts) / 150)
x2 = np.sin(np.arange(n_pts) / 150)
X = np.vstack((x0, x1, x2)).T
# plot data
plt.scatter(x0, y)
plt.show()
plt.scatter(x1, y)
plt.show()
plt.scatter(x2, y)
plt.show()

statsmodels

from statsmodels.gam.api import GLMGam, BSplines
# get B-spline basis functions
bs = BSplines(
   X,
   df=6 * X.shape[1],
   degree=3 * X.shape[1],
   include_intercept=True,
)
# get generalized additive model and apply basis functions
gam_bs = GLMGam(y, smoother=bs)
# fit B-spline regression
res = gam_bs.fit()

It might also be interesting to implement smoothing B-splines to avoid overfitting. P-splines are smoothed B-splines using a difference penalty applied to the regression coefficients of neighbouring B-splines. PyGAM can fit more than three independent variables using P-splines built with de Boor recursion.

pyGAM

from pygam import LinearGAM
# create generalized additive model for P-spline fitting (I think 3rd degree splines are fit by default
# in pyGAM but I'm not 100% sure)
gam = LinearGAM(n_splines=6)
gam.fit(X, y)

Natural cubic splines

patsy can fit natural cubic regression splines with more than three independent variables. Cardinal spline bases are used (see page 20 of this resource).

patsy

from patsy.highlevel import dmatrix
from sklearn.linear_model import LinearRegression
# get natural cubic spline bases
bases = dmatrix(
    f"cr(x0, df=6) + cr(x1, df=6) + cr(x2, df=6)",
    {"x0": X[:, 0], "x1": X[:, 1], "x2": X[:, 2]}, return_type="dataframe",
)
# remove intercept from dataframe (so that the attributes of sklearn's fitted estimator make more sense)
bases = bases.drop("Intercept", axis=1)
# fit expanded bases to response
nat_cub_spl = LinearRegression().fit(bases, y)

Smoothing natural cubic splines can be implemented using the R package mgcv. smooth.construct.cr.smooth.spec fits smooth natural cubic regression splines using a wiggliness penalty (page 246 of the 2023-07-11 documentation). Cardinal spline bases are used.

mgcv

library(mgcv)
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2)
b <- gam(y ~ s(x0, bs="cr") + s(x1, bs="cr") + s(x2, bs="cr"), family=gaussian, data=dat, method="REML")

Tensor product smooths (page 303 of the 2023-07-11 documentation) can also be used for multiple independent variables; however, I haven't spent enough times learning about them to understand what they actually do.