Doesn't work with Pandas datatypes?

HIPS / autograd

Efficiently computes derivatives of NumPy code.

MIT License

7.03k stars 915 forks source link

I am trying to implement linear regression from scratch (both in terms of code and math knowledge) for learning purposes.

I went from loops to NumPy, then from computing the gradient by hand to learning about dual numbers and now using Autograd. I am trying to convert my code to use Pandas instead of NumPy, but I am having trouble getting Autograd to work with DataFrames or Series, because I keep getting this error:

TypeError: Can't differentiate w.r.t. type <class 'pandas.core.series.Series'>
# or DataFrame

Here's the relevant parts of the code:

import autograd.numpy as np
import pandas as pd
from autograd import grad

raw_data = pd.read_csv('FiveCitiePMData/BeijingPM20100101_20151231.csv')
df = raw_data.filter(['HUMI','TEMP', 'PM_US Post'], axis=1).dropna()

class LinRegModel:
    def __init__(self, dataset):
        self.X = dataset.iloc[:,:-1].copy()  # all but last column
        self.X.insert(0, 'dummy', 1)         # padding (w_0)
        self.y = dataset.iloc[:,-1].copy()   # last column

    def __cost__(self, weights):
        errors = self.X @ weights - self.y
        return np.square(errors).mean() # no need to div by 2 because autodiff

    def train(self, epsilon=0.001, learning_rate=0.01):
        self.weights = pd.Series(0.0, index=list(self.X)) # np.zeros((self.X.shape[1], 1))
        grad_cost = grad(self.__cost__)
        last_cost = 0
        while True:
            self.weights -= learning_rate * grad_cost(self.weights)
            this_cost = self.__cost__(self.weights)
            if abs(this_cost - last_cost) < epsilon:
                break
            last_cost = this_cost

lrm = LinRegModel(df)

The __cost__(weights) function works fine no matter if weights is a NumPy array, Pandas Series, or Pandas DataFrame. But this will give me the error above:

grad_cost(self.weights)

If I do this:

grad_cost(np.array(self.weights))

Then it hangs and eventually I get this:

TypeError: Could not convert Autograd ArrayBox with value 523236993.0 to numeric

The only way I could get it working is by doing this:

def __cost__(self, weights):
        errors = np.array(self.X) @ np.array(weights) - np.array(self.y)
        return np.square(errors).mean()

grad_cost(np.array(self.weights))

But that's not a very elegant or readable solution.

What am I doing wrong?

import autograd.numpy as np import pandas as pd from autograd import grad class LinRegModel: def __init__(self, dataset): self.X = dataset[:,:-1] ### padding (w_0) - copying entire array, not efficient but works ### o = np.ones((np.shape(self.X)[0],1)) self.X = np.hstack((o,self.X)) self.y = dataset[:,-1][:,np.newaxis] # added second dimension ### initialize gradient ### self.init_grad() ### a new func to initalize gradient ### def init_grad(self): ### gave ownership of grad func to class ### self.grad_cost = grad(self.__cost__) ### moved weight init to here ### self.weights = np.zeros((self.X.shape[1], 1)) def __cost__(self, weights): errors = self.X @ weights - self.y return np.square(errors).mean() # no need to div by 2 because autodiff def train(self, epsilon=0.001, learning_rate=0.01): ### added a container to store cost at each step ### last_cost = 0 cost_history = [last_cost] while True: self.weights -= learning_rate * self.grad_cost(self.weights) this_cost = self.__cost__(self.weights) if abs(this_cost - last_cost) < epsilon: break last_cost = this_cost ### update history ### cost_history.append(last_cost) ### return cost history for analysis ### return cost_history

HIPS / autograd

Doesn't work with Pandas datatypes? #469