chendaniely / pandas_for_everyone

Repository to accompany "Pandas for Everyone"
http://a.co/d/c270uul
MIT License
397 stars 406 forks source link

Logistic regression with sklearn fails in section 13.2.2 #18

Open gsacavdm opened 2 years ago

gsacavdm commented 2 years ago

Hi! Following the example in section 13.2.2 to perform logistic regression using sklearn on the acs_ny.csv dataset results in a ConvergenceWarning and doesn't produce an intercept nor coefficients that match those in the book:

import pandas as pd
acs = pd.read_csv('../data/acs_ny.csv')

acs['ge150k'] = pd.cut(acs['FamilyIncome'], [0,150000,acs['FamilyIncome'].max()], labels=[0,1])
acs['ge150k_i'] = acs['ge150k'].astype(int)

predictors = pd.get_dummies(acs[['HouseCosts', 'NumWorkers', 'OwnRent', 'NumBedrooms','FamilyType']], drop_first=True)

from sklearn import linear_model
lr = linear_model.LogisticRegression()

results = lr.fit(X=predictors, y=acs['ge150k_i'])

ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(

To address this warning I increased the max number of iterations as follows:

lr.max_iter = 1000 # my default value was 100

But my intercept and coefficients still don't match those in the book.

Per the books guidance, ran the following commands to get

import numpy as np
values = np.append(results.intercept_, results.coef_)
names = np.append('intercept', predictors.columns)
coefs = pd.DataFrame(values, index = names, columns=['coefs'])
coefs

And these are the results I get:

coefs or
intercept -5.632904 0.003578
HouseCosts 0.000726 1.000726
NumWorkers 0.581870 1.789382
NumBedrooms 0.238619 1.269495
OwnRent_Outright 0.570278 1.768759
OwnRent_Rented -0.692253 0.500447
FamilyType_Male Head -0.330524 0.718547
FamilyType_Married 1.224612 3.402845
Very different from those in the book: coef or
intercept -5.492705 0.004117
HouseCosts 0.000710 1.000710
NumWorkers 0.559836 1.750385
NumBedrooms 0.222619 1.249345
OwnRent_Outright 1.180146 3.254851
OwnRent_Rented -0.730046 0.481887
FamilyType_Male Head 0.318643 1.375260
FamilyType_Married 1.213134 3.364012

I wouldn't be as thrown off if the differences were a few decimal points or so, but my results assign substantially less weight to OwnRent_Outright and to FamilyType_Male Head and I have no idea why...

PS - I'm have VERY little experience with statistics and data science, my background is computer science.