AllenDowney / ThinkStats2

Text and supporting code for Think Stats, 2nd Edition
http://allendowney.github.io/ThinkStats2/
GNU General Public License v3.0
4.03k stars 11.31k forks source link

Chapter 11 Exercises - PatsyError bug in pre-made code #101

Closed jmauritzDRG closed 2 years ago

jmauritzDRG commented 6 years ago

I've been loving ThinkStats2 so far for learning statistics and expanding my knowledge of Python simultaneously.

In chapter 11's exercises, I'm seeing something that has never happened before: an error in the pre-made code that comes with the exercises. After defining GoMining, when the notebook tries to call it, I get an error: "NameError: name 'patsy' is not defined"

After importing patsy in an earlier block, I try to run the code again, and then I get the error: "NameError: name 'MESSAGE' is not defined"

At that point, I tried defining MESSAGE and removing it from raise ValueError() entirely, but both of those attempts produced more errors. I'm not clued in on how this module needs to be talked to in order to function, thus I cannot work with this portion of the exercises, and I imagine nobody else can either.

Thanks again for making this fantastic book!!

EDIT: When I replace raise ValueError(MESSAGE) with print('ERROR'), it seems like I get a step further, but then I get another error -- UnboundLocalError: local variable 'results' referenced before assignment -- that I am having a lot of trouble troubleshooting on my own. I would be grateful for any insight, as I am really excited about this concept!

AllenDowney commented 6 years ago

Please see this issue https://github.com/AllenDowney/ThinkStats2/issues/31

And let me know if you find a solution there.

violet4 commented 6 years ago

MESSAGE is not defined anywhere, so even if a patsy.PatsyError occurs and you catch it, the following line will fail. maybe the intention was to print the error, in which case (at least in python 3) the fix would be adding "as MESSAGE" so it says "except patsy.PatsyError as MESSAGE:".

if an exception occurs before "results" variable is defined on line 21, then results will not be defined, and trying to use "results" on line 27 will fail. if defining "results" fails then you probably don't want to use it on line 27, especially because if the for loop succeeds once and then fails for every following iteration afterwards, then the same result will be used n times (where n is the number of iterations). I recommend placing "variables.append((results.rsquared, name))" right after line 21, so if defining results fails then it won't attempt to use it.

I look forward to reading the book when I have time! I purchased a physical copy!

AllenDowney commented 6 years ago

Ok, I will check it out as soon as I can.

On Tue, Mar 20, 2018 at 8:01 PM, mica notifications@github.com wrote:

MESSAGE is not defined anywhere, so even if a patsy.PatsyError occurs and you catch it, the following line will fail. maybe the intention was to print the error, in which case (at least in python 3) the fix would be adding "as MESSAGE" so it says "except patsy.PatsyError as MESSAGE:".

if an exception occurs before "results" variable is defined on line 21, then results will not be defined, and trying to use "results" on line 27 will fail. if defining "results" fails then you probably don't want to use it on line 27, especially because if the for loop succeeds once and then fails for every following iteration afterwards, then the same result will be used n times (where n is the number of iterations). I recommend placing "variables.append((results.rsquared, name))" right after line 21, so if defining results fails then it won't attempt to use it.

I look forward to reading the book when I have time! I purchased a physical copy!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AllenDowney/ThinkStats2/issues/101#issuecomment-374798054, or mute the thread https://github.com/notifications/unsubscribe-auth/ABy37VUR1rTM015bQZ3QOL8fYDIq7ygnks5tgZhxgaJpZM4SysR1 .

jmauritzDRG commented 6 years ago

Thanks for the quick replies!

I checked #31, and ended up in the scenario that mica5 mentioned. When I run this modified version of GoMining, my output is flooded with "MESSAGE." This is the closest I've been able to get to it working -- putting it right after line 21 with the same indentation it had on line 27 results in a syntax error. Here's the code that throws the flood of "MESSAGE"s:

`def GoMining(df): """Searches for variables that predict birth weight.

df: DataFrame of pregnancy records

returns: list of (rsquared, variable name) pairs
"""
variables = []
for name in df.columns:
    try:
        if df[name].var() < 1e-7:
            continue

        formula = 'totalwgt_lb ~ agepreg + ' + name
        formula = formula.encode('ascii')

        model = smf.ols(formula, data=df)
        if model.nobs < len(df)/2:
            continue

        results = model.fit()
        variables.append((results.rsquared, name))
    except (ValueError, TypeError):
        continue
    except patsy.PatsyError:
        print('MESSAGE')

return variables`

Thanks again!

violet4 commented 6 years ago

I did some actual digging around and found an error (there's a possibility that the error that's occurring for you is different than the one that's occurring for me, but hopefully not..) this line:

model = smf.ols(formula, data=df)

is throwing this error:

PatsyError: model is missing required outcome variables

I dug into patsy and found patsy/highlevel.py function _try_incr_builders that is looking at how to parse the formula. I tried different forms of passing in the formula - as a tuple, as a string.. but each one threw an error.

as a string (I just removed the line that encodes the string):

.../statsmodels/base/data.py in _handle_constant(self, hasconst)
    129             # detect where the constant is
    130             check_implicit = False
--> 131             const_idx = np.where(self.exog.ptp(axis=0) == 0)[0].squeeze()
    132             self.k_constant = const_idx.size
    133 

ValueError: zero-size array to reduction operation maximum which has no identity

as a tuple: formula = ('totalwgt_lb', 'agepreg + ' + name)

.../statsmodels/base/data.py in _convert_endog_exog(self, endog, exog)
    469         exog = exog if exog is None else np.asarray(exog)
    470         if endog.dtype == object or exog is not None and exog.dtype == object:
--> 471             raise ValueError("Pandas data cast to numpy dtype of object. "
    472                              "Check input data with np.asarray(data).")
    473         return super(PandasData, self)._convert_endog_exog(endog, exog)

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

but when I was digging around I found it correctly parsed one form and resulted in this: .../patsy/highlevel.py

 61     if isinstance(formula_like, str):
 62         formula_like = ModelDesc.from_formula(formula_like)

so I took the result of that, put it directly in chap11ex.ipynb:

formula = patsy.desc.ModelDesc(lhs_termlist=[patsy.desc.Term([patsy.eval.EvalFactor('totalwgt_lb')])],
      rhs_termlist=[patsy.desc.Term([]),
                    patsy.desc.Term([patsy.eval.EvalFactor('agepreg')]),
                    patsy.desc.Term([patsy.eval.EvalFactor(name)])])

so when I manually passed that in, it didn't crash, and it ran for a while and seemed to really be doing something.

I'm guessing the underlying behavior of patsy changed since this code was written. It also seems possible that patsy has a bug, because although it properly turned the input into a ModelDesc using "formula_like = ModelDesc.from_formula(formula_like)", it failed to handle the result properly, but when giving that same result as the input, it is handled properly. without looking too hard at patsy, I'm guessing that the block that expects a ModelDesc should come below the blocks that turn formula_like into a ModelDesc.

finally, this is the code I ended up with that appears to work properly:

def GoMining(df):
    """Searches for variables that predict birth weight.

    df: DataFrame of pregnancy records

    returns: list of (rsquared, variable name) pairs
    """
    variables = []
    for name in df.columns:
        try:
            if df[name].var() < 1e-7:
                continue
        except TypeError:
            continue

        formula = patsy.desc.ModelDesc(lhs_termlist=[patsy.desc.Term([patsy.eval.EvalFactor('totalwgt_lb')])],
              rhs_termlist=[patsy.desc.Term([]),
                            patsy.desc.Term([patsy.eval.EvalFactor('agepreg')]),
                            patsy.desc.Term([patsy.eval.EvalFactor(name)])])

        try:
            model = smf.ols(formula, data=df)
        except ValueError:
            continue

        if model.nobs < len(df)/2:
            continue

        try:
            results = model.fit()
        except patsy.PatsyError as MESSAGE:
            raise ValueError(MESSAGE)

        variables.append((results.rsquared, name))

    return variables

then:

sorted(variables, key=lambda x: x[0], reverse=True)
[(1.0, 'totalwgt_lb'),
 (0.94981273059780091, 'birthwgt_lb'),
 (0.30082407844707715, 'lbw1'),
 (0.13012519488625085, 'prglngth'),
 (0.12340041363361054, 'wksgest'),
...
till0r commented 6 years ago

I can confirm that @mica5 's solution fixed the issue after adding import patsy.

aoshomoji commented 6 years ago

The code appears to have an issue with the line:

formula = formula.encode('ascii')

This converted the type from string to bytes. Removing this line and leaving the formula as a string type fixed the issue for me.

AllenDowney commented 2 years ago

This should be resolved with 5f17cbe