Analyzing business income & capital gains variables in PUF2011 for imputation to CPS

Abraham-Leventhal commented 6 years ago

As mentioned in taxdata issue #119, CPS is missing the following variables -

e02000: Total Sch E income or loss (income or loss from rental real estate, royalties, partnerships, S corporations, estates, trusts, and residual interests in REMICs) e26270: Income or loss from a combined partnership or S corporation p22250: Short-term capital gain or loss p23250: Long-term capital gain or loss

I have just concluded an attempt to impute these variables from PUF2011 via an OLS regression using a subset of variables contained in both PUF2011 and CPS. Unfortunately the models did not perform well (unrealistic predictions, mean R^2 < 0.2); perhaps due to the great proportion of zeroes occurring in the dependent variable columns in PUF2011, or an underlying non-linear relationship between the dependent and independent variables.

I will continue this task by further investigating the variables of interest - e02000, e26270, p22250, p23250 - in the following ways, and I welcome any input into this process:

Explore possible correlations between these and other variables in PUF2011.
Determine the proportion of observations having non-zero values for these variables. Determine the qualities that distinguish these non-zero observations from the rest of the dataset.
Create summary statistics and summary charts for these variables, especially for the non-zero cases, analyze outliers.
Read through this Census Bureau paper for possible insight. Thanks @MattHJensen

@martinholmer @andersonfrailey @codykallen

codykallen commented 6 years ago

@Abraham-Leventhal, thanks for working on this. As you noted, the models did not work well. I believe your diagnosis of the effect of many zeros is likely correct. I would recommend using something besides OLS for the model. A tobit model would presumably be better at dealing with the many zeros as a corner solution. However, I would personally recommend experimenting with a two-stage approach:

Use a binary model (probit or logit) to predict whether the person has any of the relevant income type, regressing on a handful of variables in the PUF and the CPS.
For those with nonzero income (of the relevant type), regress that income on your preferred set of variables in the PUF and the CPS. You may want to experiment with the functional form of the dependent variable to find a good fit.
Use the results of the binary model to assign probabilities of having any of the relevant income to individuals in the CPS. Then use those probabilities to randomly assign whether they have the income or not.
For those assigned as having the relevant income, use the second stage of the model to assign them an amount of this income.
Check how the resulting distributions compare between the CPS and the PUF.

@martinholmer, @Abraham-Leventhal, does that seem like a reasonable approach to you?

Abraham-Leventhal commented 6 years ago

@codykallen

Yes that does make sense! Considering that my models were based on observations which had non-zero values in PUF2011 it makes more sense to apply their predictions only to those observations in CPS which we have reason to believe would also have non-zero values.

In 2 you mention experimenting with the functional form of the dependent variable. Is that to mean testing models that imply non-linear relationships? I am entirely new to modeling so I want to be sure I understand your meaning.

codykallen commented 6 years ago

@Abraham-Leventhal asked:

In 2 you mention experimenting with the functional form of the dependent variable. Is that to mean testing models that imply non-linear relationships?

The simplest approach would regress the income types on the independent variables. However, I recommend producing a histogram to see the distribution of the income. If it is very skewed (which is probably the case, as such distributions often have a long right tail), then you may want to regress the log of the income type on the control variables. Of course, this doesn't work if the income type can be negative. But another approach could simply add an extra step in the modeling process: predict whether the filer has nonzero income of this type; predict if the filer has positive or negative income conditional on having nonzero income; model the amount of income conditional on the amount being positive or negative.

This will obviously require plenty of experimentation and trying different approaches. But the best way to start is to visualize what you're trying to model, preferably by histograms, and then figure out what type of model you want to build.

MaxGhenis commented 6 years ago

I might suggest keeping a holdout set to evaluate the model (30% is common), especially if you're trying many models. Then you can calculate something like mean squared error on the holdout set to evaluate each model. In-sample R-squared is likely to overfit.

You could also consider something like random forests to avoid the multiple steps involved in separating out the zero or negative cases. Depending on your software, this sounds more complicated than it is: in R you can basically replace lm with randomForest when fitting your model, after installing the randomForest package. You also don't have to worry about log-transforming when using random forests.

martinholmer commented 6 years ago

Cody has the right ideas. First plot the PUF data to see the distribution you are trying to impute.

The plots will probably suggest starting by estimating a trinomial logit model with the categories being positive income, negative income, and (the omitted category) zero income.

Then for those with positive income estimate an OLS regression. There are likely to be extreme values, so a regression with log income as the dependent variable might fit the data better

Then convert the negative incomes into losses and estimate an OLS regression for that group again probably using log loss as the dependent variable

There’s a good chance that Python will not offer the trinomial logit model. If so, consider using R, the gold standard among open source statistical packages.

Abraham-Leventhal commented 6 years ago

Thank you everyone for your clear explanations of your ideas. I will get to work on implementing them.

martinholmer commented 6 years ago

@MaxGhenis, how would you use a “fitted” random forest to do the stochastic imputation?

MaxGhenis commented 6 years ago

how would you use a “fitted” random forest to do the stochastic imputation?

You'd want to save the random forest model object after fitting it from the PUF, then call its predict function on the CPS data. Cell 9 here shows the methods for sklearn. You can save sklearn models using pickle.

If you prefer a simpler linear model where the code just stores the coefficients, I'd recommend some sort of regularization to avoid overfitting. LASSO is a good approach since it will favor zeroing out less-informative features, so will require less code to make predictions. It's in both sklearn and statsmodels.

There’s a good chance that Python will not offer the trinomial logit model.

See https://pythonhosted.org/mord/ for ordinal logistic regression in Python.

martinholmer commented 6 years ago

@MaxGhenis said:

There’s a good chance that Python will not offer the trinomial logit model.

See https://pythonhosted.org/mord/ for ordinal logistic regression in Python.

This is not the kind of statistical model I'm suggesting because almost all the method of this class are for ordered categories, which is not what @Abraham-Leventhal wants to be using. There is one method that seems to be relevant:

class mord.MulticlassLogistic(alpha=1.0, verbose=0, maxiter=10000)

But there is zero documentation.

I'd be wary of using something that is undocumented. The advantage of R is that it is the premier open-source statistical package because every aspiring statistics professor is contributing to it.

MaxGhenis commented 6 years ago

Good point, I'd thought of negative/zero/positive as an order, but on second thought it's really not.

MaxGhenis commented 6 years ago

Actually sklearn has a multinomial logit: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

martinholmer commented 6 years ago

@MaxGhenis said:

how would you use a “fitted” random forest to do the stochastic imputation?

You'd want to save the random forest model object after fitting it from the PUF, then call its predict function on the CPS data. Cell 9 here shows the methods for sklearn.

But this doesn't answer my question. @Abraham-Leventhal needs to do stochastic imputation with his fitted statistical model. That means not only calculating the fitted value using the model's estimated parameters (b), but also realizing the error term, e, in the regression (Y=Xb+e). How do you that second part with a fitted randomForests model? Cell 9 in the notebook looks like it is just generating the fitted value (the Xb in a regression). What's the equivalent in a randomForests model of the OLS regression's standard error of e?

MaxGhenis commented 6 years ago

Here are a couple materials on prediction intervals for random forests in Python:

Out of curiosity, how will the interval be used? I thought existing imputations were only point estimates.

martinholmer commented 6 years ago

@MaxGhenis said:

Good point, I'd thought of negative/zero/positive as an order, but on second thought it's really not.

Actually sklearn has a multinomial logit

But when I look at the documentation for that I don't see any mathematical specification of the probabilities and I don't see anything about maximum likelihood estimation of the model's parameters (which makes me suspicious). And I see lots of language about what they're doing that would make me nervous about using that package.

Let me try to be clear about what I (and any well trained economist) mean by a multinomial logit model whose parameter are estimated with maximum likelihood methods. Look at these Princeton University class notes, in particular section 6.2.3 for the mathematical structure of the model and section 6.2.4 on estimating the parameters of the model using maximum likelihood.

Any package (in Python or any other system) that does not explicitly state it is estimating a model like this is highly suspect.

MaxGhenis commented 6 years ago

Rather, I thought they were point estimates for regressions. I recall some random number generation to allocate zeros in multi-stage imputation, is that what would happen here? That is, based on a record's probability of being zero, assign it zero with some probability, otherwise take the estimate from the OLS regression (and something similar for negatives)? Or is it also randomly chosen within the OLS prediction interval?

MaxGhenis commented 6 years ago

Any package (in Python or any other system) that does not explicitly state it is estimating a model like this is highly suspect.

Agreed, but I think sklearn--one of Python's most-used libraries--has this covered: its full documentation says

The “lbfgs”, “sag” and “newton-cg” solvers only support L2 penalization and are found to converge faster for some high dimensional data. Setting multi_class to “multinomial” with these solvers learns a true multinomial logistic regression model [5], which means that its probability estimates should be better calibrated than the default “one-vs-rest” setting.

Where [5] points to

Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 4.3.4

That textbook's Chapter 4.3.4 goes through the MLE derivations and formulas.

R is excellent too of course, but seems like staying within Tax-Calculator's core language when possible has its advantages. It also seems like the nnet package is the main one for multinomial logit, and its documentation doesn't clearly specify its underlying models.

martinholmer commented 6 years ago

@MaxGhenis said:

R is excellent too of course, but seems like staying within Tax-Calculator's core language when possible has its advantages. It also seems like the [R]nnet package is the main one for multinomial logit, and its documentation doesn't clearly specify its underlying models.

The R nnet package would never be used for what @Abraham-Leventhal wants to do because, as its documentation clearly states, it is: "Software for feed-forward neural networks with a single hidden layer".

martinholmer commented 6 years ago

@MaxGhenis said:

Any package (in Python or any other system) that does not explicitly state it is estimating a model like this is highly suspect.

Agreed, but I think sklearn--one of Python's most-used libraries--has this covered: its full documentation says

The “lbfgs”, “sag” and “newton-cg” solvers only support L2 penalization and are found to converge faster for some high dimensional data. Setting multi_class to “multinomial” with these solvers learns a true multinomial logistic regression model [5], which means that its probability estimates should be better calibrated than the default “one-vs-rest” setting.

Where [5] points to

Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 4.3.4

That textbook's Chapter 4.3.4 goes through the MLE derivations and formulas.

@MaxGhenis, thanks for the link to the 758 page Bishop book on "Pattern Recognition and Machine Learning." But the task facing @Abraham-Leventhal is not a pattern-recognition or a machine-learning task. He wants to estimate a well known statistical/econometric model using maximum likelihood. He does not want to rely on software that uses solvers that must "learn" the correct functional form of the statistical model.

@codykallen @andersonfrailey @MattHJensen

martinholmer commented 6 years ago

@MaxGhenis said:

Rather, I thought they were point estimates for regressions. ... Or is it also randomly chosen within the OLS prediction interval?

I would not use the term "randomly chosen", but any sensible imputation process will add a random realization of the regression's estimated error term to the fitted value of the Xb part of the regression, Y = Xb + e.

For imputation tasks like that facing @Abraham-Leventhal there is little danger of overfitting; getting the distribution of imputed values to look like the distribution of the original data requires stochastic imputation using both parts of the estimated regression, the Xb part and the error term part, e.

MaxGhenis commented 6 years ago

thanks for the link to the 758 page Bishop book on "Pattern Recognition and Machine Learning." But the task facing @Abraham-Leventhal is not a pattern-recognition or a machine-learning task.

Machine learning relies on classical statistics, and this book has plenty of it. Here's the relevant section, which estimates the multinomial logit using what I understand to be the standard MLE approach. Do the formulas differ from your understanding of the model?

screenshot 2018-06-08 at 08 09 56

Most tutorials I found for R multinomial logit use the nnet package, but it looks like the mlogit package uses MLE (per its vignette). Is this what you had in mind?

martinholmer commented 6 years ago

@MaxGhenis said:

Most tutorials I found for R multinomial logit use the nnet package, but it looks like the mlogit package uses MLE (per its vignette). Is this what you had in mind?

Actually, it looks like the R mlogit package estimates what is called a conditional logit model, which is what gave McFadden his Nobel prize in economics. The conditional logit model is a model of individual choice, which is not really what @Abraham-Leventhal wants for this task. A clear explanation of the differences in the two types of models (both of which are often confusingly called multinomial logit models) is in section 6.2 and section 6.3 of these Princeton University class notes.

Looking at the extensive online statistical estimation website maintained at UCLA, there is a worked R example of the trinomial model @Abraham-Leventhal needs to estimate and that example uses the nnet package. There are other ways to do the estimation in R, but this looks to be the simplest.

MaxGhenis commented 6 years ago

For imputation tasks like that facing @Abraham-Leventhal there is little danger of overfitting; getting the distribution of imputed values to look like the distribution of the original data requires stochastic imputation using both parts of the estimated regression, the Xb part and the error term part, e.

I don't follow why overfitting isn't a concern. If the CPS records represent different households than the PUF records, it sounds like a classic out-of-sample prediction problem. Optimizing for in-sample R-squared will overfit predictions, and also produce confidence intervals that are too narrow (so the realizations won't be as dispersed as they should be). For example, if one PUF record happens to have a high Schedule E and also some other specific combination of variables, and multiple CPS records happen to have that specific combination of variables, an overfit imputation could produce excessive Schedule E income by associating them with that specific PUF record's value.

I'd also expect the range of possible values for each record to be asymmetric, non-normal, and nonlinear in its predictors (even after log-transformation). A random forest would let you select a random leaf node as a realization, which would capture asymmetric ranges of predictions with less overfitting than OLS.

This blog post shows how to use model.estimators_ to construct prediction intervals; the same approach can also be used to select random realizations across the predicted CDF. Some of it is outdated, so I updated it to work with new Python 3 libraries here. It also shows how the prediction interval is generally well-calibrated; e.g., in the example, 92.5% of predicted values fell within the predicted 90% confidence interval.

MaxGhenis commented 6 years ago

Here's a random forests prediction of CPS Schedule C (e00900). Training a 100-tree model on 75% of records--using all features except e00900p and e00900s--and testing on the remaining 25% shows these results:

MAE of $5,300 on the full model.
MAE and MSE increase by 20% and 80%, respectively, when choosing from a random tree/realization instead of the forest's point estimate (as the imputation would do).
The most important features in predicting Schedule C are e00300 (taxable interest income) and e00650 (qualified dividends included in ordinary dividends).

Using the random forest point estimate, the -/0/+ sign of the prediction isn't great at 59.6%, but that's because zeroes get averaged out. When choosing a random single tree, accuracy rises to 94.3%. A trinomial model using sklearn's multinomial LogisticRegression and newton-cg solver is only 87.8% accurate. It's also extremely biased toward zero, predicting only two negatives (vs. 395 actual and 402 predicted from the random forests) and 6.1k positives (vs. 16.9k actual and 17.4k from the random forests).

All models were default (including sklearn's L2 penalty), and while the logistic model could improve with data preprocessing, it would be hard to beat the random forest IMO. I'd expect the regression models within positive and negative to have even worse performance relative to random forests, since the outcome has such a wide range.

Unless there's a strong reason to get linear coefficients from the model for interpretability, this suggests that a random forest model would produce good results--almost certainly better than a series of tuned linear models--and save a lot of analyst time since it removes the need to preprocess and could do well basically out of the box.

MaxGhenis commented 6 years ago

Actually for stochastic imputation, log-loss is a better metric than accuracy of the majority-predicted class. Here random forests outperforms the multinomial regression 0.17 to 0.35.

Abraham-Leventhal commented 6 years ago

I tried out the trinomial logistic model using R's nnet, here is my code. I used nearly all of the variables that are shared between cps and puf2011 as features, 31 features in total. The model predicts -1, 0 or 1 for each variable of interest corresponding to a negative, 0 or positive value.

I wanted to visualize the data in Python (as I am currently most comfortable with matplotlib) which is why the last command writes the validation dataset to a csv. I created difference columns in the validation dataframe which report prediction minus observation for each variable of interest. Here are histograms visualizing these difference columns for all 4 variables; first for the non-zero observations and second for the zero observations.

The meaning of the x-values for Prediction - Nonzero Observations for convenience: 2 = Predicted 1 (positive), observed -1 (negative) 1 = Predicted 0, observed 1 0 = Correct prediction -1 = Predicted 0, observed 1 -2 = Predicted -1, observed 1.

prediction minus nonzero obs

As can be seen, for those observations that have non-zero values, the model is flawed, especially for E26270 and P22250. Interestingly, is isn't only over-predicting 0 (which is represented in the -1 and 1 bars) despite the data being in some cases ~70% 0-valued for these variables, but its also quite often predicting the opposite sign.

For the zero observations the model is generally correct.

prediction minus zero obs

Comments on my use of nnet and this validation technique are welcomed. I also conducted an analysis of the variables of interest in puf2011, confirming that most of the data are 0 and the non-zero data have long-tails. I can post that analysis which includes histograms of the non-zero data if anyone is interested.

MaxGhenis commented 6 years ago

@Abraham-Leventhal If this is to be used for stochastic imputation, you'll want to look at the predicted probability of each class (+/0/-) for each record. You should be able to do this by adding type="probs" to your predict function. At that point you can use metrics like log-loss (here's intuition behind that) or Brier score to evaluate the fit.

martinholmer commented 6 years ago

@Abraham-Leventhal, could you provide more information about the estimation results? Can you show us the estimated parameters (and their standard errors) for the trinomial logit model?

Also, as @MaxGhenis suggests, what you're doing with the estimated model is not how it will be used. You want to use the fitted model to estimate the three probabilities for each filing unit and then stochastically impute the "predicted value" (zero, positive or negative) using those probabilities. To do this convert the three probabilities (which are called the probability density function or PDF in statistics) into a cumulative distribution function (or CDF as its called in statistics). Then impute the "predicted value" by generating a uniformly distributed random number in the [0,1] range. Then use the value of random number along with the filing unit's CDF to determine the "predicted value". Then show us the actual CPS distribution of zero, positive, negative values, and show us the distribution of the "predicted values" in the whole sample.

This sort of approach is standard, so if you have questions look on the internet first for details.

@andersonfrailey @codykallen

Abraham-Leventhal commented 6 years ago

@martinholmer Thanks, I will get to work on that.

Abraham-Leventhal commented 6 years ago

Trinomial logit model estimated parameters & std errors: display link github link

martinholmer commented 6 years ago

@Abraham-Leventhal provided:

Trinomial logit model estimated parameters & std errors

Avi, there's a lot of information here, which is good, but makes it difficult to focus on methodological issues. My view is that once we have the trinomial logit with positive/negative regressions approach working, we should compare those imputation results with imputation results generated by the random forests methodology.

So, while the comprehensiveness of your work so far is admirable, I think it might be better in the initial phase to focus on just one variable that we want to impute. Probably best to pick out the one variable that you think (based on results so far) is the most difficult to impute. I'll leave that choice up to you.

A second thing. Can you explain the labeling of the three categories (zero, positive, negative)? In the results which category is labeled 0 and which is labeled 1?

And a third thing. Does the R package you're using display the ratio of the estimated parameter and its standard error? If so, please ask it to display that statistic (which might be called a t statistic, but packages vary in their terminology) along with the estimated parameter value and its standard error. If not, can you compute that ratio for each explanatory variable? That ratio is very useful in giving an idea if an explanatory variable is not contributing significantly in explaining the distribution across the three categories. Thanks.

Abraham-Leventhal commented 6 years ago

OK, in that case I would focus on P22250 (short-term capital gains)!

In the results which category is labeled 0 and which is labeled 1?

0 = zero, 1 = positive.

Dataframe containing coefficients, std error, t-stats for P22250 model

martinholmer commented 6 years ago

@Abraham-Leventhal said in taxdata issue #221:

OK, in that case I would focus on P22250 (short-term capital gains)!

In the results which category is labeled 0 and which is labeled 1?

0 = zero, 1 = positive.

Dataframe containing coefficients, std error, t-stats for P22250 model

Thanks so much, this is much more readable. Does the package you're using output the value of the model's log-likelihood function? We will need that later (to do likelihood ratio tests) if we settle on the trinomial logit approach rather than the random forests approach.

I think your next step should be to develop one or more measures of how well the fitted trinomial model does at predicting the (zero, positive, negative) categories. Perhaps others who are following this issue will suggest other ways to measure the goodness of fit, but we can start with something basic: what fraction of the PUF observations (239002 by my count) have their category predicted correctly.

For each observation, let P0 denote the fitted trinomial model's probability of the observation being in the zero category and let P1 denote the probability of being in the positive category. Then if u is a uniform random number in the [0,1] range, the following logic determines the predicted category:

if u < P0:
  prdcat = 0
else if u < P0+P1:
  prdcat = 1
else:
  prdcat = 2

If you have an actual category variable coded the same way, then just count the number of observations whose predicted category value equals its actual category value.

Notice that when you do this the result depends on the stream of uniform random numbers that you use to impute the predicted category. In order to always get the same results, you need to set the random number generator's seed before generating the uniform random numbers. Check the R help about how to do that.

Then tabulate the percent of the observations whose predicted and actual categories are the same using several different seeds. How much does the percent of correct imputations vary when you change the seed?

Once you finish this, you'll be able to try out the random forests technique on short-term capital gains and we'll see how it works and what what kind of correct-imputation percents it generates. Does this make sense as a way to settle on the best methodology to use in your project?

@codykallen @andersonfrailey @MaxGhenis

Abraham-Leventhal commented 6 years ago

@martinholmer

Does the package you're using output the value of the model's log-likelihood function?

Do you mean the predicted likelihoods for each possible outcome (-1, 0, 1) for each observation? If so then predict(model, data, type = 'prob') does the job.

This would also allow me to implement your suggested method of stochastic imputation - using a uniform random number in [0,1] and the model's CDF for each observation. I will go ahead and do this and wait on your response regarding the model's output.

martinholmer commented 6 years ago

@Abraham-Leventhal said:

Does the package you're using output the value of the model's log-likelihood function?

Do you mean the predicted likelihoods for each possible outcome (-1, 0, 1) for each observation? If so then predict(model, data, type = 'prob') does the job.

No, No. Sorry for the confusion. The log-likelihood is a single value (one number) for the whole estimated (or "trained") model. It is the value of the function that is being maximized during the estimation iterations, which is why the estimation method is called "maximum likelihood".

Looking at the documentation for the nnet R package I see this:

So, it seems as if the returned nnet object contains what you want in the deviance attribute. Why is the package returning "minus twice log-likelihood"? The answer will be obvious when you read about how to conduct a likelihood ratio test. But we will be doing that kind of test only if you decide to use the trinomial logit model in your project.

@Abraham-Leventhal continued:

This would also allow me to implement your suggested method of stochastic imputation - using a uniform random number in [0,1] and the model's CDF for each observation. I will go ahead and do this and wait on your response regarding the model's output.

Maybe I'm confused but I don't think so.
These statistical packages are sometimes difficult to use correctly. I suggest you confirm that you are using this package correctly by estimating a model that is so simple that you know beforehand what the answers should be. Do this trial run by specifying a model without any explanatory variables (that is, only the intercept terms get estimated). The model fitted probabilities for every PUF filing unit should be the same: P[zero] = 0.7981, P[pos] = 0.0933, and P[neg] = 0.1086.

I'll leave it as an exercise for you to figure out from these probabilities the values of the two intercept terms that will be estimated by the R package.

I get these expected probabilities by simple tabulation of the puf.csv file:

iMac:Tax-Calculator mrh$ ./csv_vars.sh puf.csv | grep p22250
71 p22250

iMac:Tax-Calculator mrh$ awk -F, 'NR==1{next}$71==0{n++}END{print n}' puf.csv
190757

iMac:Tax-Calculator mrh$ awk -F, 'NR==1{next}$71>0{n++}END{print n}' puf.csv
22288

iMac:Tax-Calculator mrh$ awk -F, 'NR==1{next}$71<0{n++}END{print n}' puf.csv
25957

iMac:Tax-Calculator mrh$ awk -F, 'NR==1{next}{n++}END{print n}' puf.csv
239002

MaxGhenis commented 6 years ago

Accuracy over the stochastically-imputed labels can hide favoritism of the majority class; for example, if it's zero 90% of the time, a model that assigns 100% probability to the zero class will be 90% accurate. This model will look OK on paper, but fail to produce the appropriate distribution of other classes.

Log-loss--defined as "negative log-likelihood of the true labels given a probabilistic classifier’s predictions" and therefore what the model is optimizing for within-sample--doesn't have this problem. I'd encourage also looking at this in the test set to compare models.

Abraham-Leventhal commented 6 years ago

@martinholmer The residual deviance for my initial P22250_model that's trained on 80% of the data is 150597.1

Do this trial run by specifying a model without any explanatory variables (that is, only the intercept terms get estimated). The model fitted probabilities for every PUF filing unit should be the same: P[zero] = 0.7981, P[pos] = 0.0933, and P[neg] = 0.1086.

Intuitively I expect running the model without parameters would mean inputting 'y ~ 1' as the regression equation, and some research supports this. Here is how I made the model and retrieved the probabilities for each outcome. (P22250_dum = column containing the sign -1/0/1 for each observation of P22250, stored as factors for the logistic regression.

P22250_model <- multinom(P22250_dum ~ 1 , data = puf2011)
predict(P22250_model, puf2011, type = 'prob')

The model has a residual deviance of 248060.4, the intercepts are: 0: 1.6431718 1: -0.1696247 The predicted probabilities are: P[zero] = 0.7372, P[pos] = 0.1203, P[neg] = 0.1425

It would appear that I am not using the package correctly given that my predicted probabilities are not the same as yours.

Regarding using a CDF and a random number in [0,1] to predict the categorical values. Based on your code block I created this function to implement this stochastic imputation method, where 'runif' = random uniform in [0,1], 'probs' = a vector containing the predicted probability for negative / zero / positive outcomes for P22250, and probs[[1]] = P(neg), probs[[2]] = P(zero), probs[[3]] = P(pos):

stoch_imp <- function(runif, probs){
    if (runif < probs[[1]]) {
        return (-1)
    }
    else if (runif < (probs[[2]] + probs[[1]])) {
        return (0)
    }
    else {
        return (1)
    }
}

I hope what I was trying to do makes sense here. I can also link the notebook.

Abraham-Leventhal commented 6 years ago

I used this method to stochastically impute predicted values into the validation subset of puf. I performed the imputation 8 times using different seed values each time to create different sets of random uniform numbers.

I ran these imputations only on rows with non-zero P22250 values. I found the prediction accuracy was stable, ranging from 26.45% to 27.71%, with an average of 27.2%

The model's accuracy is more stable for the validation data-set as a whole (both zero and non-zero observations), with an average of 68.5% accuracy.

martinholmer commented 6 years ago

@Abraham-Leventhal said:

The model ... intercepts are: 0: 1.6431718 1: -0.1696247 The predicted probabilities are: P[zero] = 0.7372, P[pos] = 0.1203, P[neg] = 0.1425

It would appear that I am not using the package correctly given that my predicted probabilities are not the same as yours. ... I can also link the notebook.

Showing the notebook would be a good idea. Otherwise, we don't know exactly what you've done.

martinholmer commented 6 years ago

@Abraham-Leventhal said:

I used this method to stochastically impute predicted values into the validation subset of puf. I performed the imputation 8 times using different seed values each time to create different sets of random uniform numbers.

I ran these imputations only on rows with non-zero P22250 values.

Why did you do that?

I found the prediction accuracy was stable, ranging from 26.45% to 27.71%, with an average of 27.2%

The model's accuracy is more stable for the validation data-set as a whole (both zero and non-zero observations), with an average of 68.5% accuracy.

What do you mean by "more stable"?

Again, people would be better able to help you if you would show your work.

martinholmer commented 6 years ago

@Abraham-Leventhal, Here is a worked example of doing what we've been talking about. First the Python script and then the results from python mnlogit.py:

The script:

"""
Using statsmodels 0.9.0 to estimate trinomial logit model for p22250 (stcg).
"""
from __future__ import print_function
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# read puf.csv data, eliminate most, create p22250 categories in cat variable
# Note that: when stcg == 0, cat = 1
#            when stcg  > 0, cat = 2
#            when stcg  < 0, cat = 0 (the ommitted or normalized category)
puf = pd.read_csv('puf.csv')
puf.drop(columns=[c for c in puf.columns if c != 'p22250'], inplace=True)
puf['cat'] = np.where(puf['p22250'] == 0, 1, np.where(puf['p22250'] > 0, 2, 0))
print(puf.describe())

# estimate multinomial logit model and show fitted model results
model = smf.mnlogit(formula='cat ~ 1', data=puf).fit()
print(model.summary())
print('Fitted probabilities (which are the same for each PUF filing unit):')
print(model.predict(puf.iloc[[0]]))

The results:

             p22250            cat
count  2.390020e+05  239002.000000
mean  -2.497154e+03       0.984649
std    4.059647e+05       0.449027
min   -1.249000e+08       0.000000
25%    0.000000e+00       1.000000
50%    0.000000e+00       1.000000
75%    0.000000e+00       1.000000
max    3.941000e+07       2.000000
Optimization terminated successfully.
         Current function value: 0.642305
         Iterations 5
                          MNLogit Regression Results                          
==============================================================================
Dep. Variable:                    cat   No. Observations:               239002
Model:                        MNLogit   Df Residuals:                   239000
Method:                           MLE   Df Model:                            0
Date:                Tue, 19 Jun 2018   Pseudo R-squ.:               3.167e-11
Time:                        10:46:17   Log-Likelihood:            -1.5351e+05
converged:                       True   LL-Null:                   -1.5351e+05
                                        LLR p-value:                       nan
==============================================================================
     cat=1       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.9946      0.007    301.489      0.000       1.982       2.008
------------------------------------------------------------------------------
     cat=2       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.1524      0.009    -16.688      0.000      -0.170      -0.134
==============================================================================
Fitted probabilities (which are the same for each PUF filing unit):
          0        1         2
0  0.108606  0.79814  0.093254

Abraham-Leventhal commented 6 years ago

@martinholmer

I used this method to stochastically impute predicted values into the validation subset of puf. I performed the imputation 8 times using different seed values each time to create different sets of random uniform numbers. I ran these imputations only on rows with non-zero P22250 values.

Why did you do that?

I varied the seed values based on your suggestion from a few days ago.

How much does the percent of correct imputations vary when you change the seed?

Regarding testing the model on non-zero rows, I did that in addition to testing the model on the full validation dataset because I was curious if the "majority favoritism" that @MaxGhenis mentioned might exaggerate the model's predictive power. If the model is significantly less accurate in predicting the non-majority classes I though that could be of interest.

What do you mean by "more stable"?

The results varied less seed-to-seed when testing the model on the full validation dataset, compared to testing it on the non-zero observations, I assume because the former dataset is larger.

Here is a link to my work: P22250 model & testing . It includes the intercept-only model, and the 8 imputation trials for both the full and non-zero validation dataset. The results are:

Non-zero validation data mean accuracy = 27.35%
Full validation model mean accuracy = 68.47%

This is slightly different from the last results because I had previously failed to use the set.seed() functionality in the cell in which I used sample() to creating the training/validation sub-sets of the data. Thus, when running it today, the results were slightly different. I have corrected that, and now and the notebook returns the same results on each run.

MaxGhenis commented 6 years ago

You can calculate stable predicted accuracy without a random seed as 1 - MeanAbsoluteDeviation, e.g.:

prob(1)	outcome	absolute deviation
0.6	1	0.4
0.3	0	0.3
0.5	1	0.5

This would have mean absolute deviation of 0.4, and therefore predicted accuracy of 0.6. But this is not a proper scoring rule because of the majority-class issue:

ahga4tu4hcs

In case you'd like to compare log-loss between the multinomial logit and random forests, you can adapt my notebook without much effort, just:

Replace cell 3 with commands to load the PUF
Replace cell 4 with a line that keeps in df only the features you want to use, plus p22250 (i.e. df = df[['p22250', 'XTOT', otherpredictors...]])
Replace cell 6 with YCOL = 'p22250'

Then you should be able to run the whole thing as-is. Happy to hop on a call if either of you would like, you can reach me at mghenis@gmail.com.

Abraham-Leventhal commented 6 years ago

Update:

Here is a cleaner nnet imputation notebook. It contains a summary of both the full and intercept-only nnet models of P22250's sign, and 1 imputation followed by an accuracy calculation.

nnet accuracy = 68.5%

Here is that nnet model's log-loss computation (exported to Python so the same function is used to score both both nnet & random forests. header for actual P22250 signs didn't display, sorry)

nnet log-loss = 0.584

This is a random forest imputation notebook, containing mostly @MaxGhenis ' work. It uses the same training/testing datasets as the nnet model (puf80%training and puf20%testing, in the Data cells). This notebook predicts P22250 both as a sign, and as a continuous value, calculating the MAE and RMSE for the continuous value, and the % accuracy, confusion matrix, and log-loss for the predicted sign. It also contains a multinomial imputation for comparison, using sklearn's multinomial logistic regression model.

All trees random forest accuracy = 73.4% All trees log-loss = 0.615

Multinomial accuracy = 77% Multinomial log-loss = 0.81

@martinholmer @andersonfrailey Do you think it would be wise to include E00650, F2441, and N24 as predictors despite their overlap with other current predictor variables? E00650 is currently excluded as it reports qualified dividends which are reported in E00600, but tax units' composition of dividends might be a meaningful signal. F2441 and N24 respectively report 'Number of child/dependent-care qualifying persons' and 'Number of children who are Child-Tax-Credit eligible.' I hadn't included them as they overlap with EIC, which reports the number of earned income tax credit qualifying children.

However, based on what I read about these variables (EIC, Child Tax Credit, and claiming dependents) it may be useful to include all of them as they differ in the following ways. F2441 includes older dependent parents which aren't reported in EIC or N24. N24 only reports children under 17 (can be up to 24 if not disabled for EIC, F2441, older if disabled). Lastly, a non-zero EIC implies the tax unit filed for the earned income tax credit which is not otherwise reported on in our predictors.

I went ahead and ran the nnet, random forest notebooks with E00650, F2441 and N24 included as predictors, these are the results:

nnet (full notebook): nnet accuracy = 69% nnet log-loss (log loss computation) = 0.569

random forest (100 estimators): rf all trees accuracy = 73.2% rf all trees log-loss = 0.607

Multinomial accuracy = 77.6% Multinomial log loss = 0.871

At this stage it appears that:

Including the overlapping variables improves the models' peformances
Sklearn's multinomial model is weaker than both nnet and random forests, using log loss as our preferred metric
random forests is weaker than nnet, using log loss as our preferred metric

I will follow up with a confusion matrix for the nnet model so we can compare it side-by-side with random forests', and a run of the random forests model using 1000 estimators - on Max's suggestion.

codykallen commented 6 years ago

@Abraham-Leventhal asked:

Do you think it would be wise to include E00650, F2441, and N24 as predictors despite their overlap with other current predictor variables? E00650 is currently excluded as it reports qualified dividends which are reported in E00600, but tax units' composition of dividends might be a meaningful signal.

Tax units' composition of dividends is probably not a meaningful signal. Qualified dividends (e00650) are those paid by a publicly traded corporation and received by a person who has owned the stock for at least 60 days over a 121 day period. Nonqualified dividends (e00600 - e00650) are those paid to someone who has not met the hold period requirement or are paid by a tax exempt organization or a closely held C corporation. Qualified dividends are eligible for the preferential investment tax schedule (top statutory rate of 20%), but nonqualified dividends are taxed as ordinary income (top statutory rate of 37%), although both are also subject to the net investment income tax (additional 3.8% levy on "unearned" income above a limit).

Short-term capital gain or loss (p22250) is the capital gain or loss on an asset owned for less than one year. If the asset is owned for longer, then it is eligible for the preferential investment tax rates. Otherwise, it is taxed like nonqualified dividends.

Fundamentally, there are 2 roles e00600 could play. A higher value of e00600 may imply that the tax unit owns more stocks, and thus receives more dividends and has higher capital gains (short-term and long-term). However, the coefficient estimate is negative; this likely results because the payment of dividends is subtracted from net capital gain (in that an additional dollar of dividends paid reduces the market capitalization by a dollar, ignoring signaling effects).

Personally, I would recommend including a new variable, defined as e00600 - e00650, as well as e00600. This new variable should be more relevant for estimating p22250, as the receipt of nonqualified dividends is mostly due to not meeting the holding period requirement, which may be particularly relevant to short-term capital gains realizations.

Unfortunately, there is a great deal of complexity involved in actual short-term gain or loss realizations. In general tax advice terms, if an asset has provided a positive short-term gain, one should continue to hold it until the one-year vesting requirement is met, to ensure that it is eligible for the preferential rate. If an asset has provided a short-term loss, one should sell it and use the short-term loss to offset any short-term gains one has already realized or expects to realize, as the tax shield from this loss is more valuable when applied to offset short-term gains than when used to offset long-term gains.

Hope this background helps.

Abraham-Leventhal commented 6 years ago

@codykallen How does including both e00650 and e00600 - e00650 sound to you?

Below is Max's code to create confusion matrices, applied to both nnet and random forests' (with 100 estimators) predictions: (count is a column of 1's)

df_actual_and_predicted_signs.pivot_table(index='actual', columns='predicted', 
                             values='count', aggfunc=sum, margins=True)

nnet:

screen shot 2018-06-26 at 11 12 10 am

random forests:

screen shot 2018-06-26 at 11 12 27 am

Both distributions are quite good, but nnet's is nearly perfect.

codykallen commented 6 years ago

@Abraham-Leventhal asked:

How does including both e00650 and e00600 - e00650 sound to you?

That seems fine to me.

MaxGhenis commented 6 years ago

@codykallen suggested

including a new variable, defined as e00600 - e00650, as well as e00600

Predictive models are indifferent between features and linear combinations of features. For example, if the current coefficients for e00600 and e00650 are 2 and 1.5, respectively, then replacing e00650 with a new feature defined as e00600 - e00650 will produce coefficients of 0.5 for the new feature, and keep 1.5 for e00650. The predictions will be identical. Nonparametric methods like random forests will also produce identical predictions (though may differ slightly because it could affect the random number generation).

That including the new features improves model accuracy for nnet and rf suggests they're valuable. It's likely worse with the sklearn logit because it's overpenalizing, which can be addressed using cross-validation.

MaxGhenis commented 6 years ago

@Abraham-Leventhal asked

How does including both e00650 and e00600 - e00650 sound to you?

Including e00600, e00650, and e00600 - e00650 will cause the model to break due to perfect collinearity. I'd recommend sticking with the raw e00600 and e00650 predictors to make the model clean, but if you'd like you can verify that predictions are identical (or virtually identical because of random number generation) when replacing e00600 with e00600 - e00650.

Abraham-Leventhal commented 6 years ago

@codykallen @MaxGhenis I added a e00600 - e00650 column, results were identical for nnet, and virtually identical for random forest (accuracy: increased from 73.4% to 73.7% , log-loss: decreased from 0.607 to 0.606)

Abraham-Leventhal commented 6 years ago

I ran the random forests notebook with 1000 estimators and, oddly, the model performed poorly compared to 100. Link

Accuracy = 73.1% Log loss = 0.656 Confusion matrix: screen shot 2018-06-26 at 3 34 06 pm

PSLmodels / taxdata

Analyzing business income & capital gains variables in PUF2011 for imputation to CPS #221