dmsul / econtools

Econometrics and data manipulation functions.
http://www.danielmsullivan.com/econtools
BSD 3-Clause "New" or "Revised" License
110 stars 24 forks source link

Feature Request: Automatic Co-linear Variable Dropping? #2

Closed jphutch closed 4 years ago

jphutch commented 5 years ago

Hi Dan,

Wanted to see if you have any feedback on this as a potential feature model functions. When including discrete variables as factors, both R and Stata have a nice way to handle it within the formula. For some discrete variable X, in R its factor(X) and in Stata its i.X. The nice thing about this is 1) it saves time, as when I use the current "reg" function in econtools I need to consciously leave one out if the model has a constant term and 2) for certain subsets of the data different variables will be collinear and R and Stata will automatically drop one; this has caused me some grief when trying to use the current function, as its then hard to identify which of the levels is causing the problem.

What do you think, is this worth the time to try and implement it? If so, what's in your opinion a good way of going about programming this safeguard in? I will start futzing around with my clone, but am interested where in the architecture you think it goes best.

Thanks for all you do!

dmsul commented 5 years ago

The automatic detection (and optionally dropping) of co-linear variables (like Stata does) is definitely on the to do list, though I haven't really looked into how exactly it would be implemented. For example, a simple rank test is not enough if you want to ID which variables are co-linear.

If you have any ideas about how to implement it I would love to hear them. My only stipulation is that auto-dropping is not the default, rather than a warning or throwing an error. Including co-linear variables is in essence an error and should be treated as such. The behavior of just plowing through and estimating the model anyway is convenient, but is bad programming practice. It should also be possible to turn off the check, since such a check is going to be relatively expensive.

Regarding the formula, Statsmodels does this through patsy and if I were to do it here, it would be the same. However, you can also just use patsy directly so I am not very keen on incorporating it directly into Econtools.

jphutch commented 5 years ago

The syntax is not as important really, I don't mind specifying the factors, but perhaps it should at least tell you which column is giving you grief and then you can decide what to do with that information (detection as you put it). This is at least weakly more helpful than just giving you a singular matrix error. I agree with your philosophy that the default behavior should not just be dropping columns, but in some cases it might be appropriate and could be included as an option if you are aware that it is doing it.

This example works, and basically just adds the rank of the matrix up little by little and alerts you which columns don't change the rank.

import pandas as pd
from scipy.stats import uniform, norm, bernoulli
import numpy as np
from econtools.metrics import reg

N =100
_cons = np.array(([1]*N))
x1 = uniform().rvs(N)
x2 = bernoulli(.5).rvs(N)
e = norm().rvs(N)

beta = np.array([10,5,5])
X = np.stack([_cons,x1,x2],axis=1)
y = X@beta +e

# This is fine
data = pd.DataFrame(X)
data.columns = ['cons','x1','x2']
data['y'] = y
reg(data,'y',data.columns[:-1]).summary

# This throws an error.
dummies = pd.get_dummies(data['x2'])
dummies.columns = ['x20','x21']
data = pd.concat([data,dummies],axis=1)
reg(data,'y',['cons','x20','x1','x21'])

# This tells you that 4th column does not add to the rank and is thus linearly dependent
X = data[['cons','x21','x1','x20']].values
X_ = X.T@X
K = X_.shape[0]
rank_changes = [1]
for j in range(2,K+1):
    rank_changes += [matrix_rank(X_[:j,:j])-matrix_rank(X_[:j-1,:j-1])]

print(rank_changes)

BUT it uses matrix_rank from numpy and that might be very expensive. I like that it returns a list like that though, as then you could have it return all the column names that caused the issue pretty easily.

jphutch commented 5 years ago

For such a check, would it go in "fitguts" within core.py or somewhere else in the RegBase class?

dmsul commented 5 years ago

I would add a method to RegBase and have it called in main after set_sample.

dmsul commented 5 years ago

Hi @jhutchinswisc just wanted to follow up on this. If you don't have the bandwidth short term to tackle this, I might take what you've done so far and finish it off. Is that cool?

jphutch commented 5 years ago

Hi, thanks for following up. I did a little work on this a few months ago and then it just slipped my mind. The function I wrote checks rank one by one and then just prints our variable names that don't add to the rank. Please finish it off if youre up to it, I think I just needed a way to test its robustness to random cases but couldn't think of it at the time.

dmsul commented 5 years ago

No worries, I really appreciate the work you've done. I'll go ahead with it.