HIPS / autograd

Efficiently computes derivatives of numpy code.
MIT License
6.88k stars 906 forks source link

Doesn't work with Pandas datatypes? #469

Open tabidots opened 5 years ago

tabidots commented 5 years ago

I am trying to implement linear regression from scratch (both in terms of code and math knowledge) for learning purposes.

I went from loops to NumPy, then from computing the gradient by hand to learning about dual numbers and now using Autograd. I am trying to convert my code to use Pandas instead of NumPy, but I am having trouble getting Autograd to work with DataFrames or Series, because I keep getting this error:

TypeError: Can't differentiate w.r.t. type <class 'pandas.core.series.Series'>
# or DataFrame

Here's the relevant parts of the code:

import autograd.numpy as np
import pandas as pd
from autograd import grad

raw_data = pd.read_csv('FiveCitiePMData/BeijingPM20100101_20151231.csv')
df = raw_data.filter(['HUMI','TEMP', 'PM_US Post'], axis=1).dropna()

class LinRegModel:
    def __init__(self, dataset):
        self.X = dataset.iloc[:,:-1].copy()  # all but last column
        self.X.insert(0, 'dummy', 1)         # padding (w_0)
        self.y = dataset.iloc[:,-1].copy()   # last column

    def __cost__(self, weights):
        errors = self.X @ weights - self.y
        return np.square(errors).mean() # no need to div by 2 because autodiff

    def train(self, epsilon=0.001, learning_rate=0.01):
        self.weights = pd.Series(0.0, index=list(self.X)) # np.zeros((self.X.shape[1], 1))
        grad_cost = grad(self.__cost__)
        last_cost = 0
        while True:
            self.weights -= learning_rate * grad_cost(self.weights)
            this_cost = self.__cost__(self.weights)
            if abs(this_cost - last_cost) < epsilon:
                break
            last_cost = this_cost

lrm = LinRegModel(df)

The __cost__(weights) function works fine no matter if weights is a NumPy array, Pandas Series, or Pandas DataFrame. But this will give me the error above:

grad_cost(self.weights)

If I do this:

grad_cost(np.array(self.weights))

Then it hangs and eventually I get this:

TypeError: Could not convert Autograd ArrayBox with value 523236993.0 to numeric

The only way I could get it working is by doing this:

def __cost__(self, weights):
        errors = np.array(self.X) @ np.array(weights) - np.array(self.y)
        return np.square(errors).mean()

grad_cost(np.array(self.weights))

But that's not a very elegant or readable solution.

What am I doing wrong?

neonwatty commented 5 years ago

First off - this looks great! Keep going!

In terms of your errors - I think your issues are

  1. Using Pandas dfs for data and / or weights (everything in my experience needs to be either a list or autograd.numpy array). I would suggest converting data loaded in / processed in Pandas into an autograd.numpy array before passing into LinRegModel

  2. A shape issue (making y two-dimensional instead of 1 - so that errors = self.X @ weights - self.y broadcasts correctly)

Below is a version of your code with some small changes - addressing the issues above (with comments bracketed with ### blah blah blah ###). The dataset input at __init__ is an autograd.numpy array.

import autograd.numpy as np
import pandas as pd
from autograd import grad

class LinRegModel:
    def __init__(self, dataset):
        self.X = dataset[:,:-1]

        ### padding (w_0) - copying entire array, not efficient but works ###
        o = np.ones((np.shape(self.X)[0],1))
        self.X = np.hstack((o,self.X))
        self.y = dataset[:,-1][:,np.newaxis]  # added second dimension

        ### initialize gradient ###
        self.init_grad()

    ### a new func to initalize gradient ###
    def init_grad(self):
        ### gave ownership of grad func to class ###
        self.grad_cost = grad(self.__cost__)

        ### moved weight init to here ###
        self.weights = np.zeros((self.X.shape[1], 1))

    def __cost__(self, weights):
        errors = self.X @ weights - self.y
        return np.square(errors).mean() # no need to div by 2 because autodiff

    def train(self, epsilon=0.001, learning_rate=0.01):
        ### added a container to store cost at each step ###
        last_cost = 0
        cost_history = [last_cost]
        while True:
            self.weights -= learning_rate * self.grad_cost(self.weights)
            this_cost = self.__cost__(self.weights)
            if abs(this_cost - last_cost) < epsilon:
                break
            last_cost = this_cost

            ### update history ###
            cost_history.append(last_cost)

        ### return cost history for analysis ###
        return cost_history

I tested this with the following toy dataset and everything works!

X = np.random.randn(100,2)
w_0 = 0
w_1 = 1
w_2 = 1
y = w_0 + w_1*X[:,0] + w_2*X[:,1]
y = y[:,np.newaxis]
dataset = np.hstack((X,y))

That is, the gradient passed

lrm.grad_cost(lrm.weights)

and training via gradient descent works too

cost_history = lrm.train()

Hope that helps!