AllenDowney / ThinkStats2

Text and supporting code for Think Stats, 2nd Edition
http://allendowney.github.io/ThinkStats2/
GNU General Public License v3.0
4.03k stars 11.31k forks source link

Issue with code in Section 11.4 #31

Closed LucianU closed 2 years ago

LucianU commented 8 years ago

I'm getting an error when running the code in this section. Here's the shell session:

In[5]: live, firsts, others = first.MakeFrames()
ln[6]: live = live[live.prglngth > 30]
In[7]: import chap01soln
In[8]: resp = chap01soln.ReadFemResp()
In[9]: resp.index = resp.caseid
In[10]: join = live.join(resp, on='caseid', rsuffix='_r')
...
In[15]: def find_vars(data):
...         t = []
...         for name in join.columns:
...             try:
...                 if join[name].var() < 1e-7:
...                     continue
...                 formula = 'totalwgt_lb ~ agepreg + ' + name
...                 model = smf.ols(formula, data=join)
...                 if model.nobs < len(join) / 2:
...                     continue
...                 results = model.fit()
...             except (ValueError, TypeError):
...                 continue
...             t.append((results.rsquared, name))
...         return t
In[16]: t = find_vars(join)
Traceback (most recent call last):
  File "/Users/lucian/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 3035, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-16-d2e6ddbced30>", line 1, in <module>
    t = find_vars(join)
  File "<ipython-input-15-7173795bc6ef>", line 8, in find_vars
    model = smf.ols(formula, data=join)
  File "/Users/lucian/anaconda/lib/python2.7/site-packages/statsmodels/base/model.py", line 147, in from_formula
    missing=missing)
  File "/Users/lucian/anaconda/lib/python2.7/site-packages/statsmodels/formula/formulatools.py", line 65, in handle_formula_data
    NA_action=na_action)
  File "/Users/lucian/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 299, in dmatrices
    raise PatsyError("model is missing required outcome variables")
PatsyError: model is missing required outcome variables
AllenDowney commented 8 years ago

It looks like Issue #15 is back. It's a problem with Patsy, so I don't have an easy way to fix it. Encoding the formula as ascii seemed like it solved the problem, but apparently not.

Since I can't fix it, I added an error message: https://github.com/AllenDowney/ThinkStats2/commit/ca7e911a1aa103b6560661ebf8bd2cc3e6ec76d7

Workarounds: 1) Use Python 2 for this example. 2) Skip this example.

Sorry!

pglezen commented 8 years ago

It seems the same problem happens with Python2 as well. I get the stack trace encountered by @LucianU. The relevant output of my Pandas environment from pd.show_versions(as_json=False) is

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
pandas: 0.17.0

Cython: 0.22
numpy: 1.10.1
scipy: 0.15.1
statsmodels: 0.6.1
patsy: 0.3.0

I was able to get the sample to work if I encoded the formula as suggested in #15.

 formula = ('totalwgt_lb ~ agepeg + ' + name).encode('ascii')
LucianU commented 8 years ago

I can confirm that I'm using Python 2.

AllenDowney commented 8 years ago

Right, it looks like we need to encode the formula for both Python 2 and 3.

But in 3 it looks like it doesn't work even with the encode.

So the code in regression.py is the best I can do for now.

The example in the book doesn't include the encode step. I can add it, but I am not sure whether it will decrease the net level of confusion. Thinking...

AllenDowney commented 8 years ago

And does the encoding suggested by Paul Glezen work for you, too?

On Wed, Nov 25, 2015 at 5:37 AM, Lucian Ursu notifications@github.com wrote:

I can confirm that I'm using Python 2.

— Reply to this email directly or view it on GitHub https://github.com/AllenDowney/ThinkStats2/issues/31#issuecomment-159566493 .

LucianU commented 8 years ago

@AllenDowney, yes it does. I think it's worth adding it the book and specifying that it's needed because of an issue in patsy.

AllenDowney commented 8 years ago

If I understand the issues:

1) In Python 2, the code in regression.py works because it encodes the patsy formula as ascii. But the code in the book omits this line, so if someone tries to run the code directly from the book, they're going to get a confusing message. I am not sure whether adding this to the book will increase or decrease the total amount of confusion.

2) In Python 3, it seems, the code in regression.py doesn't work despite the fact that it encodes the formula in ascii. It doesn't look like I can fix this.

On Wed, Dec 2, 2015 at 2:57 AM, Lucian Ursu notifications@github.com wrote:

@AllenDowney https://github.com/AllenDowney, yes it does.

— Reply to this email directly or view it on GitHub https://github.com/AllenDowney/ThinkStats2/issues/31#issuecomment-161213282 .

FlorianGD commented 6 years ago

Hi, For what it's worth, I managed to run the code in python 3 by commenting the line that sets the encoding.

ijmiller2 commented 6 years ago

Thanks @FlorianGD, the commenting worked for me too (in Python 3).