ddbourgin / numpy-ml

Machine learning, in numpy
https://numpy-ml.readthedocs.io/
GNU General Public License v3.0
15.26k stars 3.7k forks source link

Feature Request: Online Linear Regression #70

Closed kenluck2001 closed 3 years ago

kenluck2001 commented 3 years ago

A lot of linear regression operate in static mode which means that we have to retrain to get the parameters after getting a new data point. This limits the functionality of the linear regression and make working in real-time difficult. I will implement this online formulation using ideas from the paper attached to this issue. This is based on recursive least square formulation. I have implemented in one of my past project. However, if I get approvals, then I will clean it up and create a new PR

rls.pdf

ddbourgin commented 3 years ago

That would be great @kenluck2001 - a welcome addition! How difficult would it be to add this to the existing LinearRegression object rather than create a new one? We could choose the fit algorithm using an arg at initialization: method={'online', 'batch'} and introduce a new method: update_fit(self, X, y) that can be used if method == 'online'.

kenluck2001 commented 3 years ago

I think mode={'online', 'batch'} would be more descriptive in relation to existing source code. Currently working on it

kenluck2001 commented 3 years ago

I have seen a better way that does not touch or impart any exist logic in LinearRegression class in lm.py file.

The API that I have in mind woud not distinguish between online and batch mode. It would just update beta field if a new method named update is call with new data of x, y.

At the beginning, we call

olr = LinearRegression()
olr.fit(X, y)
olr.predict(rest-x)
# on new data just call update to modify beta field and update the model in an online manner
olr.update(x_new, y_new)

Here is the pseudocode

    def __init__(self):
        self.beta = None
        self.cov = None

    def update(self, x, y):
        xm = np.mat (x)
        ym = np.mat (y)

        if self.fit_intercept:
            xm = np.c_[np.ones(xm.shape[0]), xm]

        if not self.beta or not self.cov: # first run of the algorithm
            pseudo_inverse = np.linalg.inv(xm.T @ xm) @ xm.T
            self.beta = np.dot(pseudo_inverse, ym)
            self.cov = np.dot ( xm.T, xm)
        else:
            theta_xm = np.mat ( xm * self.beta )
            error = ym - ( theta_xm )
            Im = np.mat ( np.eye (x.shape[1]) )
            self.cov = self.cov * np.mat(Im - (( xm.T * xm * self.cov ) / (1 + ( xm * self.cov * xm.T ))))
            self.beta = self.beta + ( self.cov * xm.T *  error )   
kenluck2001 commented 3 years ago

The final version will look this:

import numpy as np
from sklearn import datasets
from sklearn.datasets.samples_generator import make_blobs

class LinearRegression:
    def __init__(self, fit_intercept=True):
        self.beta = None
        self.inverse_cov = None
        self.fit_intercept = fit_intercept

    def update(self, x, y):
        xm = np.mat (x)
        ym = np.mat (y)

        if self.fit_intercept:
            xm = np.c_[np.ones(xm.shape[0]), xm]

        all_zeros_or_empty_inverse_cov = not np.any(self.inverse_cov)
        all_zeros_or_empty_beta = not np.any(self.beta)

        #if not self.beta or not self.inverse_cov: # first run of the algorithm
        if all_zeros_or_empty_inverse_cov or all_zeros_or_empty_beta:
            # Allow a batch of data in a matrix form as input
            self.inverse_cov = np.linalg.pinv(np.dot(xm.T, xm)) 
            pseudo_inverse =  np.dot(self.inverse_cov, xm.T)
            self.beta = np.dot(pseudo_inverse, ym.T) 
            print ("self.beta: {}".format(self.beta.shape))
        else:
            # Allow only single row vectors or column vectors as input
            theta_xm = np.mat ( xm * self.beta )
            error = ym - ( theta_xm )
            Im = np.mat ( np.eye (xm.shape[1]) )
            print ("Im: {}, self.inverse_cov: {}, xm: {}, self.beta: {}".format(Im.shape, self.inverse_cov.shape, xm.shape, self.beta.shape))
            #Im: (301, 301), self.inverse_cov: (301, 301), xm: (3000, 301)
            self.inverse_cov = self.inverse_cov * np.mat(Im - (( xm.T * xm * self.inverse_cov ) / (1 + ( xm * self.inverse_cov * xm.T ))))
            self.beta = self.beta + ( self.inverse_cov * xm.T * error )   

# Testing
seed = 12345
np.random.seed(seed)
n_clusters=4
# loading the dataset
orig_num_of_samples, orig_num_of_features = 3000, 300
X, y_true = make_blobs(n_samples=orig_num_of_samples, centers=n_clusters, n_features = orig_num_of_features, cluster_std=0.50, random_state=seed)

olr = LinearRegression()
olr.update(X, y_true)
y_new = 10
x_new = np.random.rand(1, orig_num_of_features)
olr.update(x_new, y_new)

y_new = 10
x_new = np.random.rand(1, orig_num_of_features)
olr.update(x_new, y_new)
kenluck2001 commented 3 years ago

My plan for testing will look this

import numpy as np
from sklearn import datasets
from sklearn.datasets.samples_generator import make_blobs
from sklearn.linear_model import LinearRegression as OriginalLinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

seed = 12345
np.random.seed(seed)
n_clusters=4
# loading the dataset
orig_num_of_samples, orig_num_of_features = 3000, 300
X, y = make_blobs(n_samples=orig_num_of_samples, centers=n_clusters, n_features = orig_num_of_features, cluster_std=0.50, random_state=seed)

X_train, X_update, y_train, y_update = train_test_split(X, y, test_size=0.2, random_state=seed)

# Ground truth
orig_lreg_model = OriginalLinearRegression()
orig_lreg_model.fit(X, y)

# Our model
olr = LinearRegression()
#olr.fit(X_train, y_train)

## update the model
for x_new, y_new in zip(X_update, y_update):
    x_new = x_new.reshape((1, orig_num_of_features))
    y_new = y_new
    olr.update(x_new, y_new)

# Evaluating
X_test, y_test = make_blobs(n_samples=orig_num_of_samples, centers=n_clusters, n_features = orig_num_of_features, cluster_std=0.50, random_state=seed+2)

y_pred_orig = orig_lreg_model.predict (X_test)
y_pred_online = olr.predict (X_test)

r2score_orig = r2_score(y_test, y_pred_orig) 
r2score_online = r2_score(y_test, y_pred_online) 

print ("Both score must be similar between scikit: {} and our own online implementation: {}".format(r2score_orig, r2score_online))
kenluck2001 commented 3 years ago

I will update the PR later

ddbourgin commented 3 years ago

This is closed via #72, but I think it would be great to generalize this to multiple examples. New issue here: #73