benmiroglio / pymatch

MIT License
272 stars 128 forks source link

Error: Perfect separation detected, results not available #29

Open wiekern opened 4 years ago

wiekern commented 4 years ago

Hi, I met an error described in the title when invoking fit_scores(). My data structrue is below image

and I draw samples 2000 for test, 20000 for control for fitting the matcher, but I have no clue why this error occurs (I have looked into the source code). In addition, I ran the example code for loan.csv successfully, so I wonder if the fields of the data should not be string, rather integer? In fact, the data structure of loan example contains string as well see below image

Hope anyone can help, thanks!

mark-mediware commented 4 years ago

@wiekern Not sure if it helps you, but I had similar errors and was pretty stuck. After some basic data analysis, I realized I had a few input variables with very limited distribution across groups (ex. Binary age bin with 10,000 rows = 0, and 5 rows = 1). After removing these variables/features, I had no errors.

Again, not sure if that's applicable to you, but was my (embarrassing ) issue.

wiekern commented 4 years ago

Thanks for your answer! The distribution might not be the problem, that was my view. I am wondering if the regression model supports input with string like in my case column of "text". I am think of I must be convert text into a numeric value or word embeddings (vector).

umangdadhaniya commented 3 years ago

model = sm.logit('Result ~ Year + Amount_Spent + Popularity_Rank', data = train_data).fit() Traceback (most recent call last):

File "", line 1, in model = sm.logit('Result ~ Year + Amount_Spent + Popularity_Rank', data = train_data).fit()

File "C:\Users\UMANG\anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py", line 1963, in fit bnryfit = super().fit(start_params=start_params,

File "C:\Users\UMANG\anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py", line 227, in fit mlefit = super().fit(start_params=start_params,

File "C:\Users\UMANG\anaconda3\lib\site-packages\statsmodels\base\model.py", line 519, in fit xopt, retvals, optim_settings = optimizer._fit(f, score, start_params,

File "C:\Users\UMANG\anaconda3\lib\site-packages\statsmodels\base\optimizer.py", line 215, in _fit xopt, retvals = func(objective, gradient, start_params, fargs, kwargs,

File "C:\Users\UMANG\anaconda3\lib\site-packages\statsmodels\base\optimizer.py", line 327, in _fit_newton callback(newparams)

File "C:\Users\UMANG\anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py", line 211, in _check_perfect_pred raise PerfectSeparationError(msg)

PerfectSeparationError: Perfect separation detected, results not available

model = sm.logit('Result ~ Year + Amount_Spent + Popularity_Rank', data = train_data).fit() Traceback (most recent call last):

File "", line 1, in model = sm.logit('Result ~ Year + Amount_Spent + Popularity_Rank', data = train_data).fit()

File "C:\Users\UMANG\anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py", line 1963, in fit bnryfit = super().fit(start_params=start_params,

File "C:\Users\UMANG\anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py", line 227, in fit mlefit = super().fit(start_params=start_params,

File "C:\Users\UMANG\anaconda3\lib\site-packages\statsmodels\base\model.py", line 519, in fit xopt, retvals, optim_settings = optimizer._fit(f, score, start_params,

File "C:\Users\UMANG\anaconda3\lib\site-packages\statsmodels\base\optimizer.py", line 215, in _fit xopt, retvals = func(objective, gradient, start_params, fargs, kwargs,

File "C:\Users\UMANG\anaconda3\lib\site-packages\statsmodels\base\optimizer.py", line 327, in _fit_newton callback(newparams)

File "C:\Users\UMANG\anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py", line 211, in _check_perfect_pred raise PerfectSeparationError(msg)

PerfectSeparationError: Perfect separation detected, results not available