PSLmodels / taxdata

The TaxData project prepares microdata for use with the Tax-Calculator microsimulation project.
http://pslmodels.github.io/taxdata/
Other
19 stars 30 forks source link

Statsmodels' MNLogit encountering issues predicting sign of P22250 #244

Closed Abraham-Leventhal closed 6 years ago

Abraham-Leventhal commented 6 years ago

@martinholmer @MaxGhenis @andersonfrailey

Below is my code which is intended to predict P22250's sign using statsmodels' MNLogit:

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

#Remove aggregate rows, replace NaN's with 0.

puf = pd.read_csv('puf2011.csv')
puf = puf[(puf['RECID'] != 999996) &
          (puf['RECID'] != 999997) &
          (puf['RECID'] != 999998) &
          (puf['RECID'] != 999999)
         ]

puf = puf.fillna(0)

#Create categorical sign column, {-1: neg, 0: zero, 1: pos}

puf['sign'] = np.where(puf['P22250'] == 0, 0, np.where(puf['P22250'] > 0, 1, -1))

#Prune puf to include only: RECID, P22250, categorical and predictor columns.

predictors =  ['DSI', 'EIC', 'MARS', 'XTOT', 'E00200', 'E00300', 'E00400','E00600', 
               'E00650', 'E00800', 'E00900', 'E01100', 'E01400', 'E01500', 'E01700',
               'E02100', 'E02300', 'E02400', 'E03150', 'E03210', 'E03240', 'E03270',
               'E03300', 'E17500', 'E18400', 'E18500', 'E19200', 'E19800', 'E20100',
               'E20400', 'E32800', 'F2441', 'N24']

keep = ['RECID', 'sign'] + predictors

puf = puf[keep]

#Formula = sign ~ DSI + EIC + ... + F2441 + N24

formula = 'sign ~ ' + ' + '.join(predictors)
model = smf.mnlogit(formula = formula, data= puf).fit()
model.summary()

Below are the errors this produces for me. Model.summary() outputs a summary where all of parameter values are NaN.

RuntimeWarning: overflow encountered in exp
  eXB = np.column_stack((np.ones(len(X)), np.exp(X)))
/anaconda3/lib/python3.6/site-packages/statsmodels/discrete/discrete_model.py:2128: RuntimeWarning: invalid value encountered in true_divide
  return eXB/eXB.sum(1)[:,None]
/anaconda3/lib/python3.6/site-packages/statsmodels/base/optimizer.py:271: RuntimeWarning: invalid value encountered in greater
  oldparams) > tol)):
Optimization terminated successfully.
         Current function value: nan
         Iterations 7

/anaconda3/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
/anaconda3/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
/anaconda3/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)

Considering the possibility that perfectly colinear predictors were my problem, I removed E00650 (contained in E00600) as well as F2441 and N24 (might be colinear with EIC as they all report on qualifying children). This did not change the results.

I am using Python 3.6.4, Statsmodels 0.90, and scipy 1.1

MaxGhenis commented 6 years ago

Has mnlogit ever run successfully, or have you only been using sklearn for Python multinomial modeling?

A couple things you could try:

  1. Just use one predictor, e.g. sign ~ E00200.
  2. Use the non-formula version, i.e. sm.MNlogit(puf.sign, puf[predictors]). Personally I prefer this anyway, since you can avoid the formula hackery.

mnlogit and MNlogit worked for me in a very basic example (notebook).

Abraham-Leventhal commented 6 years ago

@MaxGhenis Yeah mnlogit runs successfully in other cases, just not in this one when using all of the predictors that I'd like.

  1. Using just 1 predictor, and random small batches of them, MNLogit seems to be working fine.
  2. The non-formula version does not work either, its encountering the same errors when I use all of the predictors.

I will do some trial and error and see which predictors are causing problems.

Abraham-Leventhal commented 6 years ago

@MaxGhenis After trial and error I have found that removing E00650 and E01100 allows smf.mnlogit to run:

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

#Read, drop aggregate rows, fill NaN with 0

puf = pd.read_csv('puf2011.csv')
puf = puf[(puf['RECID'] != 999996) &
          (puf['RECID'] != 999997) &
          (puf['RECID'] != 999998) &
          (puf['RECID'] != 999999)
         ]

puf = puf.fillna(0)

#Create categorical sign column

puf['sign'] = np.where(puf['P22250'] == 0, 0, np.where(puf['P22250'] > 0, 1, -1))

#Prune puf to include only: RECID, P22250, categorical and predictor columns.
#E00650 and E01100 have been removed to allow mnlogit to work.  

predictors =  ['DSI', 'EIC', 'MARS', 'XTOT', 'E00200', 'E00300', 'E00400','E00600',
               'E00800', 'E00900', 'E01400', 'E01500', 'E01700','E02100', 'E02300', 
               'E02400', 'E03150', 'E03210', 'E03240', 'E03270','E03300', 'E17500', 
               'E18400', 'E18500', 'E19200', 'E19800', 'E20100','E20400', 'E32800', 
               'F2441', 'N24']

keep = ['RECID', 'sign'] + predictors

puf = puf[keep]

#Formula = sign ~ DSI + EIC + ... + F2441 + N24

formula = 'sign ~ ' + ' + '.join(predictors)
smf.mnlogit(formula = formula, data = puf).fit().summary()

While E00650 is perfectly colinear with E00600, I don't know what's special about E01100. It's described as "Capital gain distributions not reported on Sch D." A full description of each of the predictors that I am using can be found below - I do not see why E01100 is such a problem.

Click to expand {'DSI': '1 if claimed as dependent on another return; otherwise 0', 'EIC': 'number of EIC qualifying children (range: 0 to 3)', 'FLPDYR': 'Calendar year for which taxes are calculated', 'MARS': 'Filing (marital) status: line number of the checked box [1=single, 2=joint, 3=separate, 4=household-head, 5=widow(er)]', 'XTOT': 'Total number of exemptions for filing unit', 'e00200': 'Wages, salaries, and tips for filing unit', 'e00300': 'Taxable interest income', 'e00400': 'Tax-exempt interest income', 'e00600': 'Ordinary dividends included in AGI', 'e00650': 'Qualified dividends included in ordinary dividends', 'e00800': 'Alimony received', 'e00900': 'Sch C business net profit/loss for filing unit', 'e01100': 'Capital gain distributions not reported on Sch D', 'e01400': 'Taxable IRA distributions', 'e01500': 'Total pensions and annuities', 'e01700': 'Taxable pensions and annuities', 'e02100': 'Farm net income/loss for filing unit from Sch F', 'e02300': 'Unemployment insurance benefits', 'e02400': 'Total social security (OASDI) benefits', 'e03150': 'Total deductible IRA contributions', 'e03210': 'Student loan interest', 'e03240': 'Domestic production activities from Form 8903', 'e03270': 'Self-employed health insurance deduction', 'e03300': 'Contributions to SEP, SIMPLE and qualified plans', 'e17500': 'Sch A: Medical and dental expenses', 'e18400': 'Sch A: State and local income/sales taxes', 'e18500': 'Sch A: Real-estate taxes paid', 'e19200': 'Sch A: Interest paid', 'e19800': 'Sch A: Gifts to charity: cash/check contributions', 'e20100': 'Sch A: Gifts to charity: other than cash/check contributions', 'e20400': 'Sch A: Miscellaneous deductions subject to 2% AGI limitation', 'e32800': 'Child/dependent-care expenses for qualifying persons from Form 2441', 'f2441': 'number of child/dependent-care qualifying persons', 'n24': 'Number of children who are Child-Tax-Credit eligible, one condition for which is being under age 17'}
MaxGhenis commented 6 years ago

I dug into the CPS data here and confirmed that e00650 is pretty collinear with e00600, with a correlation coefficient exceeding 0.999, including when limiting to the ~quarter of records with nonzero values of either.

I don't think e01100 should be problematic though; its maximum absolute correlation with other features is 0.12, and removing it didn't change the minimum eigenvector value (a way to test collinearity). Instead, separately removing each predictor from your list of predictors using a loop shows that e19800 and e20100 could be the problem. These are defined respectively as

Itemizable charitable giving: cash/check contributions.

and

Itemizable charitable giving: other than cash/check contributions.

Both include this in their description:

WARNING: this variable is already capped in PUF data.

Their correlation exceeds 0.9999999. Could you try removing one of these and keeping e01100? Or if PUF data is different you may want to check this there.

@andersonfrailey seems like these should be uncorrelated, is this a taxdata issue?

image

MaxGhenis commented 6 years ago

Ah looks like CPS only has a single total for charitable contributions, which is split for all using the ratio in the PUF: https://github.com/open-source-economics/taxdata/blob/4f86528bc264f11fe70583cbf35a2a4a5b8b7983/cps_data/finalprep.py#L149

Given this I'd recommend adding e19800 and e20100 into a new feature, and using that for imputation across the two datasets.

It seems like this might not be the issue if you're hitting it in the PUF, so I'd also re-run my notebook using PUF data to identify the collinearity.

martinholmer commented 6 years ago

@Abraham-Leventhal said he was executing this Python script:

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

#Read, drop aggregate rows, fill NaN with 0

puf = pd.read_csv('puf2011.csv')
puf = puf[(puf['RECID'] != 999996) &
          (puf['RECID'] != 999997) &
          (puf['RECID'] != 999998) &
          (puf['RECID'] != 999999)
         ]

puf = puf.fillna(0)

#Create categorical sign column

puf['sign'] = np.where(puf['P22250'] == 0, 0, np.where(puf['P22250'] > 0, 1, -1))

#Prune puf to include only: RECID, P22250, categorical and predictor columns.
#E00650 and E01100 have been removed to allow mnlogit to work.  

predictors =  ['DSI', 'EIC', 'MARS', 'XTOT', 'E00200', 'E00300', 'E00400','E00600',
               'E00800', 'E00900', 'E01400', 'E01500', 'E01700','E02100', 'E02300', 
               'E02400', 'E03150', 'E03210', 'E03240', 'E03270','E03300', 'E17500', 
               'E18400', 'E18500', 'E19200', 'E19800', 'E20100','E20400', 'E32800', 
               'F2441', 'N24']

keep = ['RECID', 'sign'] + predictors

puf = puf[keep]

#Formula = sign ~ DSI + EIC + ... + F2441 + N24

formula = 'sign ~ ' + ' + '.join(predictors)
smf.mnlogit(formula = formula, data = puf).fit().summary()

There are several things about this script that I don't understand.

  1. Why are you doing this:

    puf = pd.read_csv('puf2011.csv')

    when there apprears to be no such file in the taxdata repository? For example, here is what I get in my taxdata repository:

    $ pwd
    /Users/mrh/work/OSPC/taxdata
    $ find . -name puf2011.csv
    $ 

    What taxdata subdirectory is your script in?

  2. Why are you doing this:

    puf['sign'] = np.where(puf['P22250'] == 0, 0, np.where(puf['P22250'] > 0, 1, -1))

    This sets sign to -1 when gains are negative. Are you sure that a variable with values -1, 0, 1 will have the omitted category be -1? Remember we want to estimate logit coefficients for the zero and positive categories. Maybe what you're doing is OK, but I can't tell because you don't show your results.

  3. Why are you doing this:

    
    #Formula = sign ~ DSI + EIC + ... + F2441 + N24

formula = 'sign ~ ' + ' + '.join(predictors)


This approach is not appropriate because you are treating all the "predictors" as quantitative variables (like `e00200` earnings) when some are categorical variables (like `MARS`).  What you're doing is assuming that a couple (`MARS==2`) has twice the effect as a single individual (`MARS==1`).  The numbers assigned to `MARS` are completely arbitrary.  You need each multiple-category categorical variable (`MARS` and  `EIC` at a minimum) to be a set of dummy variables.  Read [this R tutorial](https://stats.idre.ucla.edu/r/modules/coding-for-categorical-variables-in-regression-models/) about how to use the `C` function in an R formula to automatically generate a set of dummy variables.

4. Finally, statistical models are not built to be used like machine learning models.  In particular, they often don't deal well when a laundry list of "predictors" is thrown at them.  You have already experienced that.  I suggest you start with the simplest model, one with no "predictors" (just a constant term in the R formula), just to see if the `sign` variable is specified correctly (see my question 2).  Then add more variables that have a plausible reason to be included in a prediction model of short-term capital gains.
MaxGhenis commented 6 years ago

You need each multiple-category categorical variable (MARS and EIC at a minimum) to be a set of dummy variables.

Like R, statsmodels creates k-1 dummies for you if the predictor is a string (example). So you can transform these to strings to avoid the need for dummies.

NB: sklearn requires dummies, which you can generate with the pandas get_dummies function. NB2: Nonlinear methods like random forests will be OK with these as-is, though it wouldn't hurt.

Finally, statistical models are not built to be used like machine learning models. In particular, they often don't deal well when a laundry list of "predictors" is thrown at them.

I agree that when you care about inference on the coefficients, you may want to be thoughtful of which predictors to include, especially ones correlated to the coefficients you care most about. You also hit model stability issues when the number of predictors approaches the number of observations. But I don't think either of these match our use case: we're most interested in accurate predictions. Removing features (except very highly collinear ones) could reduce model accuracy (as was seen earlier regarding N24), so I'd recommend checking the log-loss after changing the set of predictors.

Abraham-Leventhal commented 6 years ago

@martinholmer

  1. What taxdata subdirectory is your script in?

My script's subdirectory is outside of my local copy of the repo for these preliminary scripts. This will be changed for any final PR.

  1. Are you sure that a variable with values -1, 0, 1 will have the omitted category be -1? Remember we want to estimate logit coefficients for the zero and positive categories. Maybe what you're doing is OK, but I can't tell because you don't show your results.

Yes, -1 is the omitted category and the output reports the parameters for sign = 0, 1. I did not include this output because almost every value is NaN. Here is a notebook where you can see my output (link). Does this indicate the sign variable is specified correctly?

  1. You need each multiple-category categorical variable (MARS and EIC at a minimum) to be a set of dummy variables.

OK, I will implement this.

  1. In particular, they often don't deal well when a laundry list of "predictors" is thrown at them. You have already experienced that.

Aside from this particular case where my model breaks entirely, in my experience this summer I have found that increasing the number of predictors only increased the accuracy of the model I was using (for both OLS and multinomial logit, considering R^2, log-loss as scoring metrics).

I suggest you start with the simplest model, one with no "predictors" (just a constant term in the R formula), just to see if the sign variable is specified correctly (see my question 2). Then add more variables that have a plausible reason to be included in a prediction model of short-term capital gains.

If by specified correctly you mean the model is omitting one category and reporting parameters for the other two then it seems to be specific correctly. I will start small and move up in in the quantity of variables used.

Question: Is there a standard minimum marginal predictive power associated with a variable below which one should not include it as a predictor in one's model? By that I mean, how small of an increase in predictive power (in psuedo-R-squared or log-loss or % accuracy of categorical predictions) justifies the inclusion of a variable in a model? This would be informed by the intuitive connection between the predictor and the response, but sometimes my intuitive sense that variables will be correlated is inaccurate.

Thanks very much for your input

martinholmer commented 6 years ago

@Abraham-Leventhal asked this question:

Question: Is there a standard minimum marginal predictive power associated with a variable below which one should not include it as a predictor in one's model? By that I mean, how small of an increase in predictive power (in psuedo-R-squared or log-loss or % accuracy of categorical predictions) justifies the inclusion of a variable in a model? This would be informed by the intuitive connection between the predictor and the response, but sometimes my intuitive sense that variables will be correlated is inaccurate.

Good question. There is a rough rule of thumb and an exact test of whether dropping a variable does not significantly reduce the model's explanatory performance. The rule of thumb: drop variables whose t-statistic (in absolute value) is less than two Exact test: likelihood ratio test (as I mentioned weeks ago)

MaxGhenis commented 6 years ago

@martinholmer what's the value of dropping variables based on those rules of thumb if including them produces more accurate predictions? As a user I'm only interested in accurate data, not parsimony or inferentially useful coefficients.

Abraham-Leventhal commented 6 years ago

@MaxGhenis

Like R, statsmodels creates k-1 dummies for you if the predictor is a string (example). So you can transform these to strings to avoid the need for dummies.

Thanks for the heads up, this is much simpler than making dummies.

I made a puf version of your collinearity notebook. I'm getting markedly different minimum eigenvectors, all around 0.15, much larger than yours which are ~ 5e-8. puf2011's E19800 & E20100 appear not to be collinear, and if I am interpreting correctly my notebook implies E01100 is not collinear with any variables.

MaxGhenis commented 6 years ago

Thanks @Abraham-Leventhal for re-running that notebook, which also shows that e00650 and e00600 are considerably less correlated in PUF vs. CPS. I see this makes sense as the CPS e00650 is often set as a constant fraction of e00600. I'd suggest using e00600 only.

I'm getting markedly different minimum eigenvectors, all around 0.15, much larger than yours which are ~ 5e-8.

The minimum eigenvector value was 5e-8 when including the nearly-collinear e19800 and e20100. Removing either of these brought it up to 0.03, still potentially low enough to do another round.

Even though these aren't collinear in PUF, you should add them into a single feature since they don't mean the same thing in CPS data. You want to train models on data as similar to the data you'll predict on, and the only consistency between the two datasets is the total.

MaxGhenis commented 6 years ago

I re-ran the collinearity notebook after replacing e19800 and e20100 with their sum, e19800_e20100. This shows that the next-most-collinear pair is EIC and n24, with a correlation of 0.97. Could you check the correlation of these in PUF? Since adding n24 improved the model performance before, even if it's a similar correlation I don't think this indicates it needs to be dropped (pending the larger discussion around what this project is optimizing for, and how that influences whether to drop any variables that improve the model).

Abraham-Leventhal commented 6 years ago

@MaxGhenis Ah, I understand now. CPS only has one charitable contribution variable as we split it into e19800/e20100 using the ratio of E19800/E20100 in PUF. Only using their sum as a predictor makes sense.

From your comment to Martin:

As a user I'm only interested in accurate data, not parsimony or inferentially useful coefficients.

As I understand it, our primary goal is an accurate imputation. If including nearly-irrelevant or mostly-collinear predictors in our model allows it to predict CPS values more accurately, then I think we should do it - assuming those variables don't make the model fail for other reasons.

@martinholmer Do you agree that:

  1. We should use the sum of E19800 and E20100 as a predictor instead of the two separately? They cover schedule A charitable contributions that are cash/check and non-cash/check respectively. The CPS only reports one charitable contribution value, and we split it into e19800 and e20100 using the ratio of E19800/E20100 from PUF.
  2. The primary goal of the imputation is accuracy, thus, should we use as many predictors as possible (even nearly-irrelevant, most-collinear ones) as long as they improve the model's log-loss score in predicting the PUF data?
MaxGhenis commented 6 years ago

Variable selection can also be useful if you're concerned about overfitting, but given we're comparing against an out-of-sample set, predictors that improve that log-loss are likely to improve the end model.

Abraham-Leventhal commented 6 years ago

@martinholmer said in this comment:

You need each multiple-category categorical variable (MARS and EIC at a minimum) to be a set of dummy variables.

EIC does not appear to be categorical, it records the number of children qualifying for the EIC (between 0 and 3) according to the TaxData documentation. Here's a screenshot of the PUF documentation.

screen shot 2018-07-11 at 3 04 11 pm

Here's a link to my previous comment which contains at the bottom (in "click to expand") a description for each variable shared between cps and puf, most of which are used in this prediction task. Link

It appears that MARS is the only categorical variable used.

Abraham-Leventhal commented 6 years ago

@MaxGhenis

Here is the puf collinearity notebook that addresses the correlation between EIC and N24 in puf, as well as an analysis of the highest correlations between all of the predictors of interest in puf. It's all at the bottom under "Check general correlations." Link

Just above that section it also analyzes E19800 and E20100, which aren't well correlated in puf.

Abraham-Leventhal commented 6 years ago

@MaxGhenis @martinholmer

The goal of this issue is to understand what's going wrong with my MNLogit code in the task of predicting the sign of P22250.

Based on your input I have changed the MARS column to a string to ensure it is interpreted as a categorical variable. I also performed a correlation analysis of the predictors to check if collinearity was causing the problem. You can look at that analysis here: Link 1. E00650 was found to be highly correlated with E00600 (around 0.94), as expected. The other highly-correlated predictors are:

E32800 and F2441 (with correlation of 0.84)

N24 and XTOT (with correlation of 0.77)

Based on these results I tried running the model without the correlated predictors, first one at a time, and then in groups. I ran the model using the same code that I use in the first comment on this issue, except that a line is added to change MARS to a column of strings, and the correlated predictors were removed. I found that the model encountered the same errors in each trial, whether all the correlated variables were removed or not, which is to say the model encountered the same error as it does in this notebook: Link 2.

Thus, I ran more trials, starting with a one-predictor model and adding more until the model broke. From this process I found that the model partially works if all predictors except E01100 are included as can be seen in this notebook: Link 3. In that notebook, the parameters are not NaN, but they're mostly very small and other values like the Psuedo R-squared and Log-Likelihood are NaN.

Finally, I found that removing both E01100 and E00650 allows the model to run smoothly, as can be seen in this notebook: Link 4. Again, the code I am using is identical to the code from my first comment, except MARS is made into a string column and E01100 and E00650 are removed as predictors.

At this point I have the following outstanding questions regarding this particular issue:

  1. Why is E01100 breaking the model?
  2. Should I continue this project whether or not we know the answer?

And the following questions regarding the project that were brought up in this issue:

  1. Should we use the sum of E19800 and E20100 as one predictor instead of the two separately?
  2. Is the primary goal of this imputation accuracy, and does that mean we should use as many predictors as possible if they improve the model's log-loss?
martinholmer commented 6 years ago

@Abraham-Leventhal said:

Based on your input I have changed the MARS column to a string to ensure it is interpreted as a categorical variable.

Let me stop you right there. I never suggested any such thing. In fact I suggested a totally different approach.

MaxGhenis commented 6 years ago

@martinholmer see from https://github.com/open-source-economics/taxdata/issues/244#issuecomment-403872275:

Like R, statsmodels creates k-1 dummies for you if the predictor is a string (example). So you can transform these to strings to avoid the need for dummies.

Here's the example shown, where Region also has a value of C which is set as the baseline:

image

@Abraham-Leventhal Given the E01100 mystery, my 2c would be to try sklearn with a high c parameter. You could try a toy example to confirm similar results to statsmodels, but at this point I don't see any reason why you'd be hitting the issues you're seeing. Ideally there's be a minimal reproducible example to share with the statsmodels team, but it sounds like you've given that a good shot to no avail, and the inability to share the full raw data with them hinders debugging.

The other features you listed aren't so highly correlated to be problematic IMO, e.g. this SO question suggests 0.95 as a sample threshold for collinearity.

MaxGhenis commented 6 years ago

Should I continue this project whether or not we know the answer?

One option would be to proceed with the best model that runs (i.e. without E01100 for now), since you still need to do the linear regression conditional on sign. It's still possible that a random forest (or some other model if you wanted to explore one) outperforms the trinomial+lm approach on final outcome metric, which would make this error moot.

andersonfrailey commented 6 years ago

Catching up on the discussion in this issue. I agree with @Abraham-Leventhal and @MaxGhenis that for this specific project the we're going for accurate data more than anything else and if including additional variables in the model achieves that we should use them. @martinholmer I understand your point about statistical models not performing well when a ton of variables are thrown at them indiscriminately, but my understanding is that @Abraham-Leventhal has found which ones were causing problems and removed them from the analysis. Are there other concerns you have with what he's still including?


I'm in favor of using the sum of e19800 and e20100 as a predictor rather than using each individually because the CPS does not distinguish between cash and non-cash charitable giving. We make that split during the final prep stages for the CPS.


Based on your input I have changed the MARS column to a string to ensure it is interpreted as a categorical variable.

Let me stop you right there. I never suggested any such thing. In fact I suggested a totally different approach.

I may be off here, but I think y'all are talking about roughly the same thing. My interpretation of what @martinholmer suggested is that we should create three new variables based on MARS. Something like mars_1, mars_2, mars_3, each with a 1/0 value based on MARS for that unit. While @MaxGhenis is saying that statsmodels will automatically interpret MARS as a categorical variable if it is a string, removing the need to create those new MARS based variables. From the linked example:

In [10]: res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()

In [11]: print(res.params)
Intercept         38.651655
C(Region)[T.E]   -15.427785
C(Region)[T.N]   -10.016961
C(Region)[T.S]    -4.548257
C(Region)[T.W]   -10.091276
Literacy          -0.185819
Wealth             0.451475
dtype: float64

In this example, Region is a categorical variable with possible values of E, N, S, W. statsmodels recognized that and created the needed categorical variables. Is that an accurate summary? If so, wouldn't both approaches have the same results?

MaxGhenis commented 6 years ago

@andersonfrailey that's right, with one clarification:

In this example, Region is a categorical variable with possible values of E, N, S, W. statsmodels recognized that and created the needed categorical variables. Is that an accurate summary? If so, wouldn't both approaches have the same results?

Region in this case also has a value of C, which is used as the baseline to avoid collinearity (you always want k-1 dummies). It picks the first alphabetically as baseline, so for MARS the baseline value would be 1 and it would create dummies for 2/3/4/5.

Abraham-Leventhal commented 6 years ago

@martinholmer I wanted to clarify that the goal with MNLogit is to stochastically impute its predictions for the sign of P22250 (and the other variables) using the same technique as I did with nnet - a random uniform in [0,1] and the vector of probabilities for each sign from the model.

Thanks!

martinholmer commented 6 years ago

@Abraham-Leventhal said:

I wanted to clarify that the goal with MNLogit is to stochastically impute its predictions for the sign of P22250 (and the other variables) using the same technique as I did with nnet - a random uniform in [0,1] and the vector of probabilities for each sign from the model.

Yes, it seems to me that that's one of the steps in the imputation process.

Abraham-Leventhal commented 6 years ago

It turns out that in puf, E01100 is a column of 0. While I am not well versed in multinomial logit models I am guessing this is why E01100 breaks mnlogit. So that question has perhaps been answered.

martinholmer commented 6 years ago

@Abraham-Leventhal said:

It turns out that in puf, E01100 is a column of 0.

Maybe, but it doesn't look that way to me. Here is what I get:

Matching mrh$ ls -l puf2011.csv 
-rw-r--r--@ 1 mrh  staff  91305290 Jul 23 17:44 puf2011.csv

Matching mrh$ ~/work/OSPC/tax-calculator/csv_vars.sh puf2011.csv | grep E01100
47 E01100

Matching mrh$ awk -F, 'BEGIN{mn=9e9;mx=0}NR==1{next}{t++}$47<mn{mn=$47}$47>mx{mx=$47}END{print t,mn,mx}' puf2011.csv 
163790 0 282700

So among the 163,790 filing units in the raw PUF file, I see a minimum value for E01100 of zero and a maximum value of over one-quarter of a million dollars.

Did I do something wrong?

Abraham-Leventhal commented 6 years ago

@martinholmer No, I did, my mistake! The mystery remains unsolved.

Abraham-Leventhal commented 6 years ago

As I have found that removing E01100 allows mnlogit to perform perfectly I am closing this issue.

martinholmer commented 6 years ago

@Abraham-Leventhal said:

As I have found that removing E01100 allows mnlogit to perform perfectly I am closing this issue.

Given that the definition (in the user documentation) of e01100 is as below, why was it every considered as an explanatory variable in the CPS imputation equations (since it doesn't exist in any meaningful way in the CPS data)?

screen shot 2018-08-07 at 10 36 36 am

The CPS data know nothing about IRS Form 1040 or Schedule D.

Abraham-Leventhal commented 6 years ago

@martinholmer I considered any variable that was in puf and was noted in the documentation as available in taxdata_cps as a potential predictor.