PSLmodels / taxdata

The TaxData project prepares microdata for use with the Tax-Calculator microsimulation project.
http://pslmodels.github.io/taxdata/
Other
19 stars 30 forks source link

Imputation of e02000, e26270, p23250, p22250 from PUF to CPS #267

Open Abraham-Leventhal opened 5 years ago

Abraham-Leventhal commented 5 years ago

Hi, I am creating this as a general-purpose issue for this imputation task. As a preliminary step, here is my understanding of our proposed methodology for this task (using P22250 as a test case).

  1. Create a model to predict the sign of P22250 in cps (this has been done)
  2. For observations in cps predicted to have a zero, impute a zero.
  3. For observations in cps predicted to have negative or positive P22250 values, create separate models trained on the negative or positive puf P22250 data, then use those models to impute continuous values to their respective cps obserations.

I am comparing models for step 3 of this task. Taking the advice of @martinholmer, given the long-tails of the response variable, I am testing an OLS regression on on log(P22250) (log(-P22250) for negative data). Furthermore I will be using the same 80/20 training and testing split method as used to compare models in step 1. This is to say that for this portion of the task I will be working with 4 subsamples of the puf dataset created by splitting the data into negative/positive, then splitting each subsample into training/testing. I will train the negative P22250 model on the negative training subsample, and test it on the negative testing subsample, and work analogously with the positive model.

Finally, I plan on scoring the models' predictions for the testing datasets using MSE on the suggestion of @MaxGhenis, but I am of course open to other suggestions.

MaxGhenis commented 5 years ago

I'd recommend splitting into training/testing first, then splitting based on predicting sign. This will more easily enable you to compare performance between this branched model and the random forests.

MSE is a reasonable first step, but since we care about the distribution and the accuracy of the sign, I recommend the quantile loss function. I shared more detail on this at https://github.com/open-source-economics/taxdata/issues/221#issuecomment-401876770.

Abraham-Leventhal commented 5 years ago

@MaxGhenis OK thanks, I will implement that. I am going to post an initial question regarding my OLS model soon as well.

Abraham-Leventhal commented 5 years ago

My first issue i've run into using an OLS model that regresses on log(P22250) is that the model has so far been producing absurd predictions (in the billions) for P22250, and I wanted to ask for input on my code regarding that issue.

You can see my code at the bottom of this comment (under "Click to expand"). The predictors I use in that code are those that I thought would be good predictors and had p-values < 0.10. I also investigated more predictor groups in this notebook: OLS overestimation with various predictor groups. There I found that including any of the continuous predictors is what creates the massive overestimations, but lacking those predictors the model greatly underestimates P22250. Also, the problem persists (often getting worse) after altering the seed value to create new training/testing samples.

Note: pos_train and neg_train are the training datasets sampled from the positive P22250 and negative P22250 puf subsamples. pos_test and neg_test are the positive and negative testing datasets from the corresponding subsamples.

Some ideas as to what's causing the problem:

  1. The problem might be the presence of collinearity between our predictors, as is noted in the warning regarding the model's large condition number (1.14e+07). However, an analysis of the predictors' correlation coefficients revealed that most have correlation coefficients of less than 0.45, even with their most-correlated co-predictor, and the predictors I use in this model are all below that line, as can be seen in the bottom of this notebook under "Check general correlations": puf correlations analysis Perhaps the presence of some correlation between all of the predictors is still too great an issue.

  2. I also considered that divide by zero error I'm getting in response to the code that creates the log_P22250 column might be causing the issue. However, after working around that error by instead using a loop to create log_P22250 column (rather than using np.where()), I found the model produces the same results, even though the error doesn't manifest.

  3. Perhaps the issue is that I am doing the second log-transformation incorrectly, that is, the transformation of the model's predictions from log(P22250) to a real P22250 value. The correct method seems simple - to take the predicted log-values and transform them using np.exp() as i do in this code:

    log_predictions_pos = fit_pos.predict(pos_test[pred_pos])
    predictions_pos = np.exp(log_predictions_pos)

    where log_predictions_pos are the model's predictions for log(P22250) for the positive testing data, and predictions_pos should be the correctly re-transformed predictions - but researching this specific question was surprisingly fruitless, so I am not sure if this is the correct method.

  4. Lastly, I tested an OLS model in which I also log-transformed the predictors themselves, based on some research which suggested this might be useful - especially if such a transformation makes those predictors normally distributed. This has had some promising results which I will soon include in another comment, unless the responses to this comment indicate a different path entirely.

Click to expand code ``` import numpy as np import pandas as pd import statsmodels.api as sm from sklearn import metrics from sklearn.model_selection import train_test_split #Remove aggregate rows, replace NaN with 0 puf = pd.read_csv('puf2011.csv') puf = puf[(puf['RECID'] != 999996) & (puf['RECID'] != 999997) & (puf['RECID'] != 999998) & (puf['RECID'] != 999999) ] puf = puf.fillna(0) #MARS to k-1 dummies puf[['MARS2', 'MARS3', 'MARS4']] = pd.get_dummies(puf['MARS'], drop_first = True) #These vars are combined in cps puf['E19800_E20100'] = puf['E19800'] + puf['E20100'] #All vars shared between puf and cps except E00650 (colinear w/E00600) and E01100 (crashing mnlogit) potential_pred = [ 'DSI', 'EIC', 'MARS2', 'MARS3', 'MARS4', 'XTOT', 'E00200', 'E00300', 'E00400', 'E00600', 'E00800', 'E00900', 'E01400', 'E01500', 'E01700', 'E02100', 'E02300', 'E02400', 'E03150', 'E03210', 'E03240', 'E03270', 'E03300', 'E17500', 'E18400', 'E18500', 'E19200', 'E19800_E20100','E20400', 'E32800', 'F2441', 'N24' ] #Log of response puf['log_P22250'] = np.where(puf['P22250'] == 0, 0, np.sign(puf['P22250'])*np.log(abs(puf['P22250']))) keep = ['P22250', 'AGIR1', 'log_P22250'] + potential_pred puf = puf[keep] np.random.seed(100) train, test = train_test_split(puf, test_size=0.2) #Subsamples of train/test where P22250 > 0, or < 0 pos or neg for 2nd stage imputation pos_train = train.copy()[train.copy()['P22250'] > 0] neg_train = train.copy()[train.copy()['P22250'] < 0] pos_test = test.copy()[test.copy()['P22250'] > 0] neg_test = test.copy()[test.copy()['P22250'] < 0] pred_pos = [ 'DSI', 'EIC', 'MARS2', 'MARS3', 'MARS4', 'XTOT', 'E00200', 'E00300', 'E00400', 'E00600', 'E03240', 'E18400', 'E18500' ] endog_pos = pos_train['log_P22250'] exog_pos = pos_train[pred_pos] fit_pos = sm.OLS(endog_pos, exog_pos).fit() summary_pos = fit_pos.summary() #Predictions of log(P22250), transforming back to P22250 log_predictions_pos = fit_pos.predict(pos_test[pred_pos]) predictions_pos = np.exp(log_predictions_pos) max_predictions_pos = round(predictions_pos.max(), 2) MSE_pos = round(metrics.mean_squared_error(actual_pos, predictions_pos),2) print( 'Largest predicted P22250: $' + '{:,}'.format(max_predictions_pos), '\n', 'MSE is: ' + str(MSE_pos), '\n', summary_pos )
Abraham-Leventhal commented 5 years ago

Based on further reading it seems that to justify log-transforming our predictors I should be looking at how that transformation changes the model's residuals - either making them symmetrical or not changed systematically by the value of the response variable. I will follow with this residual analysis of the above model, and the log-transformed-predictor model soon.

In the mean time I have two methodological questions:

  1. As the 2nd stage models are intended to predict the non-zero data alone, they will be trained and testing on relatively small subsamples of puf. Specifically, pos_train and post_test have ~ 15,800 and 3,907 observations respectively, and neg_train and neg_test have 18,600 and 4,744 observations respectively. Is this small enough to motivate evaluating the models not on their performance on one training/testing subsample, but on their average performance across multiple, so that more puf data have the chance to be training/testing data?

  2. E02300 (Unemployment insurance benefits) and E03210 (student loan interest deduction) seem like potentially useful signals, but only to indicate if someone is unemployed or if someone is still paying off student loans. I would imagine that those who are either unemployed or paying off student loans would tend to have less income from capital gains or corporations/partnerships. That being said, do we really gain more information by knowing if they have larger or smaller student loans or unemployment benefits? I don't think we do, except in so far as having greater unemployment benefits indicates that one's wages are higher, but that data should be capture in E00200. I would propose making these variables into dummy variables: 0 if 0, and 1 if > 0 (both are only 0 or positive).

Abraham-Leventhal commented 5 years ago

Looking at the residuals it doesn't appear that log-transforming the predictors is having much of a beneficial effect, despite the predictions being somewhat more realistic.

Regressing on log(P22250) without log-transforming predictors

Largest predictions are so outrageous that most of residual plot is incomprehensible

1

Removing all predictions greater than $100,000,000 (8 in total) tells a similar story, most large residuals are concentrated in observations with low P22250 values.

2

And removing all predictions larger than $10,000,000 (14 in total) again shows that the smaller P22250 values are generally overestimated.

3

Regressing on log(P22250) having log-transformed the continuous predictors.

Here we are using the same predictors, the only difference in the code is that I log-transformed the continuous predictors, and the residuals, while much smaller, have the same overall pattern - lots of overestimation clustered around the smaller actual P22250 values.

4

I would not think this supports log-transforming the predictors, and unless there's something wrong in how I am using the OLS model / log transformations, I think this analysis wouldn't support using an OLS model to predict P22250's value at all.

hdoupe commented 5 years ago

@Abraham-Leventhal asked:

  1. As the 2nd stage models are intended to predict the non-zero data alone, they will be trained and testing on relatively small subsamples of puf. Specifically, pos_train and post_test have ~ 15,800 and 3,907 observations respectively, and neg_train and neg_test have 18,600 and 4,744 observations respectively. Is this small enough to motivate evaluating the models not on their performance on one training/testing subsample, but on their average performance across multiple, so that more puf data have the chance to be training/testing data?

By "average performance across multple", do you mean Cross-Validation?

If so, I think this is plenty of data to do that. One way to do CV is:

  1. Chose how many folds you want. It is common to use K=10, but this is somewhat arbitrary
  2. For each of fold K=k, train your model on the remaining samples *
  3. Evaluate predictions on hold out data (fold k). Save the MSE (or whichever metric you are using)
  4. Choose model with best average metric score across the different CV iterations

* Some sources recommend that you do predictor selection in this step. If you do predictor selection outside of this step, then theoretically, your model is biased because you evaluated your predictors on the basis of the entire sample. So some "knowledge" of the fold of samples that is excluded in this step is already baked into the process. However, this particular problem requires substantial human evaluation to select the predictors. So, I think it's unreasonable to do so in (2) for this particular problem. For reference see, section 7.10.2 in The Elements of Statistical Learning.

However, my only experience with this is from an ML class I took a couple years ago. @MaxGhenis does this jive with your understanding of CV?

Abraham-Leventhal commented 5 years ago

@hdoupe

By "average performance across multiple", do you mean Cross-Validation?

Cross validation requires that you run K training/testing iterations for each model you are evaluating, such that for each model, every observation is used once and only once in a testing dataset across the iterations. Maybe we should do that, but I was leaving open the possibility of doing multiple training/testing iterations without satisfying the strict requirements of a K-fold cross validation - perhaps by just altering the seed value before re-partitioning puf, and running the tests again. Then doing that 3 or 4 times. Perhaps puf is large enough that this would work. Or maybe a the K-fold method is a better idea.

  • Some sources recommend that you do predictor selection in this step...

I don't understand what you're saying here. I do see one way that having to choose both predictors and models adds a layer of complexity (for instance, what if some models perform better on some predictor sets, aren't we implicitly favoring one model by choosing our predictors for the cross-validation based on their performance in that model in a preliminary prediction task...etc), but I think my knowledge of model bias is a little shallow to grasp the rest of your point.

hdoupe commented 5 years ago

I don't think the requirements are too strict. Also, I'm not so sure about choosing an ad hoc method over it. The steps seem straightforward:

  1. shuffle data
  2. divide data set into K partitions (0 through N/k, N/k +1 through 2*N/k, 2*N/k + 1 through 3*N/k, ...)
  3. train on all but the k-th data partition. Evaluate on the k-th partition
  4. average the results
  5. select model with best average score

Depending on how your model is set up, you may be able to use sklearn's CV functionality.

Does that make more sense?

RE the second part: I guess the whole idea is to build a model so that it performs well even on data that it hasn't seen yet. You simulate that by removing a partition of your data set and evaluating your model on it. But, if you choose your predictors based off of the entire data set, including the partioned out part, then it's like your model has had a "glimpse" of the partioned out data. This biases your test error rate down (makes your model appear slightly better than it actually is), since the model already has some information on that hold out partition of the data via the predictor selection process.

In cross-validation, the idea is to repeat this train/test process K times and average the results in hopes of mitigating the effects of fortunate or unfortunate train/test splits. So, if you have already selected your predictors beforehand, then your model has some information on the holdout fold, even though it appears to only be trained on the other K-1 folds.

However, if the predictor selection process is resource intensive (i.e. you examining lots of correlations between predictors and transformations of predictors), it's probably not worth going through the trouble of screening predictors for each iteration of cross-validation.

Does that make anymore sense?

Abraham-Leventhal commented 5 years ago

@hdoupe

Sure, the method makes sense and if you think K-fold cross validation would be best I would go along with that.

But, if you choose your predictors based off of the entire data set, including the partitioned out part, then it's like your model has had a "glimpse" of the partitioned out data.

Ah this makes a lot of sense.

However, if the predictor selection process is resource intensive (i.e. you examining lots of correlations between predictors and transformations of predictors), it's probably not worth going through the trouble of screening predictors for each iteration of cross-validation.

This is probably true as well. Of course, I'd have to get past these issues with OLS before getting to that point.

Given the distribution of the non-zero P22250 data (and this is true for the other imputation variables as well), I was thinking that an exponential model might be of use, what do you think of this? Here is a screenshot of the distribution of 95% of the P22250 data that I made earlier in my internship: screen shot 2018-07-27 at 2 41 53 pm

These tails continue left and right quite far, and we have two major negative outliers with P22250 values of about -124,000,000 and -60,000,000 with the next-lowest being about -30,000,000 at which point the gaps between data points shrink quite a bit.

MaxGhenis commented 5 years ago

CV is helpful for variable selection and other tuning parameters; for example, random forests do something like CV as part of the algorithm, and there are prebuilt CV methods for hyperparameter tuning in models like the Lasso (see LassoCV in sklearn). In general, sklearn is much more built for predictive modeling than statsmodels so has more resampling and evaluation techniques built-in. Since you're using a standard OLS though, there's no hyperparameter tuning going on, so I don't think CV is necessary (you should expect better predictions if using regularized models like lasso or ridge).

That said, I'd still recommend assessing the data on a holdout set for a final comparison of models. This lets you experiment with the training set in multiple ways (CV, bootstrap, etc.) while having a clear way to make the final judgment. Modeling competitions like those from Kaggle operate in this way.

Abraham-Leventhal commented 5 years ago

@MaxGhenis Thanks for the suggestions, I'm working on comparing LassoCV and your earlier random forests model for this stage of the project (continuous predictions).

Question: As both lasso and random forests contain prebuilt CV (or CV-like) methods would it be valid to train them on the entire (positive or negative portions of the) puf data set then test them on the (positive or negative portions of the) holdout test dataset? If the models are doing their own CV does that mitigate the danger of allowing them to get a glimpse of the testing data during the training stage?

The other option would be to do what I am doing with the OLS models: train the positive-data models on the positive portion of the training subset (pos_train), test those models on the positive portion of the testing subset (pos_test), and so on.

Thanks again for your feedback

Abraham-Leventhal commented 5 years ago

Assuming that we'd go with the latter option (training both models only on the training dataset), I tried out both random forests and LassoCV in this notebook and it appears LassoCV outperforms random forests (especially for the negative data) though the error is quite large for both: rf / LassoCV notebook

I'm transforming MARS to k-1 dummy variables because it doesn't appear that scikit learn will do so automatically if MARS is a string, and based on my research one is supposed to only use k-1 dummies when k = # of categories. Please correct me if I am wrong in this regard.

MaxGhenis commented 5 years ago

You're comparing two models:

  1. Random forests (just a single model)
  2. Trinomial logit + two linear models for positive and negative logit predictions

In each case you're evaluating on the test set, like this: image

If you want to segment your evaluation into positive/negative/zero, I'd recommend doing so on consistent segments of the testing data, e.g. positive/negative/zero. In the most straightforward case you're comparing a record of which both model 1 and model 2 correctly predict the sign, and you're comparing the error. There will also be cases where models disagree with each other or the actual sign, so segmenting on true sign will be fairest.

I'm transforming MARS to k-1 dummy variables because it doesn't appear that scikit learn will do so automatically if MARS is a string, and based on my research one is supposed to only use k-1 dummies when k = # of categories. Please correct me if I am wrong in this regard.

Right, sklearn doesn't automatically transform as statsmodels does. pd.get_dummies is helpful. k-1 dummies avoids collinearity.

Abraham-Leventhal commented 5 years ago

OK, I had thought that since mnlogit produced better categorical predictions for P22250 than random forests (using log loss as our metric) we'd only consider mnlogit as a sign-prediction model (unless we tested more categorical prediction models). Thus, if random forest were to be used at all, it would be only for the second stage of the task - predicting P22250's continuous value for those observations predicted by mnlogit to be non-zero.

It seems that you're suggesting that random forests should only be considered as a model if it is used for both stages, is that correct?

In the most straightforward case you're comparing a record of which both model 1 and model 2 correctly predict the sign, and you're comparing the error. There will also be cases where models disagree with each other or the actual sign, so segmenting on true sign will be fairest.

This is an interesting issue that I had not considered. What do you mean by "segmenting on true sign will be fairest?" Is that to say that we should only compare MSE's for those observations where both methods get the sign correct?

Thanks

MaxGhenis commented 5 years ago

Logit + 2 RFs could be a third model, but RF alone is worth testing and I'd personally start with a single RF vs. logit+LM. The single RF will perform better with more data.

The most important thing IMO is deciding how to decide between models. Whatever the metric is, it should apply to the entire test set. If it's an average of quantile loss functions, it's possible that the single RF would outperform the logit+2RFs, even if the single RF is worse at the -/0/+ classification; this could also be true of MSE.

My "segmenting on true sign will be fairest" regarded your comparison of MSE between RFs based on predicted sign. I was suggesting that if you're interested in MSE on the test set, you can also look at particular slices of the test set to see where models do better/worse, but a set of metrics without a predefined weighting function won't give a clear answer on which model to choose.

Abraham-Leventhal commented 5 years ago

@MaxGhenis

Logit + 2 RFs could be a third model, but RF alone is worth testing and I'd personally start with a single RF vs. logit+LM. The single RF will perform better with more data.

This makes sense.

Here is an idea! If we are to try logit + RF, instead of training two RF's on the pos/neg data, train just one RF. Then run the logit model and impute 0 for all observations predicted to be 0. Then, for the logit's negative predictions, impute the mean of the negative RF estimators for that observation (or a random RF estimator). For the positive predictions, impute the mean of the positive RF estimators for that observation. This would probably lower the MSE but would avoid the "zero-fuzzing" that occurs when using the average of the entire row for all observations.

MaxGhenis commented 5 years ago

This would probably lower the MSE but would avoid the "zero-fuzzing" that occurs when using the average of the entire row for all observations.

You should select a random tree's prediction from the random forest, not the average prediction. This makes the "stochastic" part of stochastic imputation built-in and avoids zero-fuzzing.

For a fair comparison, you should compare that against a random draw from the prediction interval of the linear models. This simulates how the real data would be imputed.

I'd hold off on adding complexity to the RF model until there's a clear case that random forest by itself is sufficiently problematic. As you mentioned in an email, log-loss isn't actually much worse when selecting different seeds, and adding complexity adds room for overfitting (and just modeling errors).

Deep quantile regression is another model which, on a theoretical basis, should produce the best predictions for this use case.

Abraham-Leventhal commented 5 years ago

You should select a random tree's prediction from the random forest, not the average prediction. This makes the "stochastic" part of stochastic imputation built-in and avoids zero-fuzzing.

For a fair comparison, you should compare that against a random draw from the prediction interval of the linear models. This simulates how the real data would be imputed.

I am still a bit unclear on the motivation behind inserting randomness, which underlies some of my incorrect assumptions going in to this process (directly imputing predicted categories in stage 1, doing the same for continuous values in stage 2, rather than using a stochastic process).

Martin mentioned in the prior issue that it would make the distribution of data more realistic, which makes sense. However, you've also mentioned that it would help deal with overfitting, which is not so clear to me.

Either way, in the short term, I will go about the following:

Thanks again max!

MaxGhenis commented 5 years ago

I am still a bit unclear on the motivation behind inserting randomness, which underlies some of my incorrect assumptions going in to this process (directly imputing predicted categories in stage 1, doing the same for continuous values in stage 2, rather than using a stochastic process).

Stage 1 also has randomness: rather than selecting the class with the highest probability, you've been selecting a random number and selecting based on how that number falls with respect to the probabilities of each predicted class. Selecting from the CDF in stage 2 is the same idea, just from the continuous angle.

The advantage of randomness is getting the appropriate spread of predictions. Without randomness, you'd predict something like the mean (point estimate), which would be relatively close to zero for all observations. While that's best for each individual estimate, it produces a set of predictions that's too narrow: for example, if you had a reform that only taxed high values of P22250, or had progressive brackets of it, you'd underestimate the impact if not also pushing some tax units toward the high range of their prediction interval.

However, you've also mentioned that it would help deal with overfitting, which is not so clear to me.

Did I imply that somewhere? This was not my intent. It's just about realistic data.

Unfortunately I may have led you astray with sklearn, for which I'm having no luck finding prediction intervals. They've explicitly said it's out of scope. So you may need to return to statsmodels after all, which has a get_prediction function that includes prediction intervals.

That said, if the OLS point estimate is significantly less accurate than the RF, it's unlikely that the quantiles would do better (if anything I'd expect the RF quantiles to be better than OLS, even if they had the same point estimate MSE). So you might be able to save yourself that work.

Here's a sample implementation of the quantile loss function.

MaxGhenis commented 5 years ago

Here's an example using sklearn for prediction intervals. It's not clear whether it works for regularized models like Lasso.

Abraham-Leventhal commented 5 years ago

Stage 1 also has randomness

Right, I was at least aware that I was inserting randomness into the stage 1 using a runif + CDF to impute P22250's sign, but I've only now realized that this is to apply to both stages and why! It's about getting a realistic distribution

That said, if the OLS point estimate is significantly less accurate than the RF, it's unlikely that the quantiles would do better (if anything I'd expect the RF quantiles to be better than OLS, even if they had the same point estimate MSE).

Yes so far my non-stochastic imputations of P22250's continuous value using OLS and LassoCV (regressing on P22250, or log(P22250), or log(P22250) + log(continuous predictors)) have been orders of magnitude worse than RF in MSE. Its impressive that RF is able to that when its trained on data that are (in P22250's case) 65% zero - without simply predicting almost entirely near-zero values. Check out RF's predictions of P22250 over AGI vs actual puf data (this is for the testing dataset) it's quite good, though I should mention I cut out the 3 most-negative values in the real testing data which RF missed entirely (with around 20, 35, 125 million in losses respectively):

screen shot 2018-08-03 at 12 13 13 pm

Also thanks for the quantile loss / prediction interval links

martinholmer commented 5 years ago

@Abraham-Leventhal, I see that you're having problems in your regression with log(gain) for the p22250 variable. Most of the commentary here in issue #267 is impossible to follow because the comments rarely show what you're actually doing. But in this comment you do provide a link to a notebook where you're estimating a log(gain) regression and that notebook looks, in part, like this:

screen shot 2018-08-03 at 6 13 32 pm

Your selection of explanatory variables (the right-hand-side variables in the regression) is unconventional.

First, it is conventional to include a constant term in the regression. In fact, it is so conventional that the very first thing the statsmodels documentation shows you is how to add a constant term to the list of regression variables. So at this Linear Regression documentation page we have the following example:

screen shot 2018-08-03 at 6 27 12 pm

I would be interested in seeing what results you get when you include a constant term in your regression.

Second, the EIC variable is categorical (like the MARS variable), so you should convert EIC into a set of dummy variables (omitting the first category) just like you did with the MARS variable.

MaxGhenis commented 5 years ago

Second, the EIC variable is categorical (like the MARS variable), so you should convert EIC into a set of dummy variables (omitting the first category) just like you did with the MARS variable.

See @Abraham-Leventhal's https://github.com/open-source-economics/taxdata/issues/244#issuecomment-404278599 on EIC being continuous.

Abraham-Leventhal commented 5 years ago

@martinholmer My mistake re: the constant term, thanks for catching that! I will get to that when I am back on my work laptop on Monday and post my results and my code.

Most of the commentary here in issue #267 is impossible to follow because the comments rarely show what you're actually doing.

In the two comments where I discuss scripts directly (OLS comment - see raw code at bottom of comment, which you linked, and this one comparing RF and LassoCV) I provide my code. Otherwise I didn't provide code where we discussed general methodology/stats or where I just wanted to show graphs about which I did not have specific questions. Let me know if you would like to see the code behind anything else I have posted here.

Second, the EIC variable is categorical (like the MARS variable), so you should convert EIC into a set of dummy variables (omitting the first category) just like you did with the MARS variable.

It doesn't appear that EIC is categorical, based on the PUF booklet: pufeic

Abraham-Leventhal commented 5 years ago

Here is a notebook comparing random forests to OLS models using RMSE. My code is also included at the bottom of this comment. I test out OLS models regressing on log(P22250) both without transforming the predictors and with log-transforming the continuous predictors. I made sure to include a constant term in the OLS models, and to distinguish between RF's overall RMSE and its RMSE in predicting pos/neg data - doing so reveals that OLS outperforms RF (in the positive/negative data) using RMSE as a metric when regressing on log(P22250) and using log-transformed continuous predictors.

I will follow up with comparisons using quantile loss, and other analyses of these models such as the distribution of their predictions over AGI which @andersonfrailey thought would be an important metric.

You also might have noticed that RF's log-loss in this notebook is better than the older number we were getting in this notebook. This is important because it was that score that we used to determine that nnet outperformed RF as a categorical predictor, and thus that mnlogit should be used considered in that task. It turns out that the method of calculating RF's log-loss used here returns different results depending on the random seed used to create the train/test partition, and RF thus sometimes outperforms mnlogit in log-loss. In some trials I conducted I found that mnlogit's log-loss varies little, hovering around 0.58, between random seeds, while RF varies in 0.57 - 0.71. I will provide the code demonstrating this in a future comment, or in my internship-end summary that I give to Anderson. You can also see it by changing the seed value in Cell [2].

Click to expand full code (without some of the summary data included in the notebook) ``` import numpy as np import pandas as pd import statsmodels.api as sm from statsmodels import regression from sklearn import ensemble from sklearn import linear_model from sklearn import metrics from sklearn import model_selection from sklearn.model_selection import train_test_split # Data # Remove aggregate rows, replace NaN with 0 puf = pd.read_csv('puf2011.csv') puf = puf[(puf['RECID'] != 999996) & (puf['RECID'] != 999997) & (puf['RECID'] != 999998) & (puf['RECID'] != 999999) ] puf = puf.fillna(0) # Constant column puf['constant'] = 1 # MARS to K - 1 dummies puf[['MARS2', 'MARS3', 'MARS4']] = pd.get_dummies(puf['MARS'], drop_first = True) # E19800 and E20100 combined in CPS puf['E19800_E20100'] = puf['E19800'] + puf['E20100'] # Categorical dependent variable for 1st stage puf['sign'] = np.where(puf['P22250'] == 0, 'zer', np.where(puf['P22250'] > 0, 'pos', 'neg')) # Log response column. When x < 0, result is -log(-x) puf['log_P22250'] = np.where(puf['P22250'] == 0, 0, np.sign(puf['P22250'])*np.log(abs(puf['P22250']))) # All variables shared between puf and cps, except for E00650 (colinear w/E00600) predictors = [ 'DSI', 'EIC', 'MARS2', 'MARS3', 'MARS4', 'XTOT', 'E00200', 'E00300', 'E00400', 'E00600', 'E00800', 'E00900', 'E01400', 'E01500', 'E01700', 'E02100', 'E02300', 'E02400', 'E03150', 'E03210', 'E03240', 'E03270', 'E03300', 'E17500', 'E18400', 'E18500', 'E19200', 'E19800_E20100','E20400', 'E32800', 'F2441', 'N24', 'E01100' ] # Log columns for continuous predictors. When x < 0, result is -log(-x) discretes = ['DSI', 'EIC', 'MARS2', 'MARS3', 'MARS4', 'XTOT', 'F2441', 'N24'] logs = [] for i in predictors: if i not in discretes: puf['log_' + i] = np.where(puf[i] == 0, 0, np.sign(puf[i])*np.log(abs(puf[i]))) logs.append('log_' + i) keep = ['RECID', 'AGIR1', 'sign', 'P22250', 'log_P22250', 'constant'] + predictors + logs puf = puf[keep] np.random.seed(100) train, test = train_test_split(puf.copy(), test_size=0.2) # Sub-df's where P22250 != 0, and is pos or neg pos_train = train.copy()[train.copy()['P22250'] > 0] neg_train = train.copy()[train.copy()['P22250'] < 0] pos_test = test.copy()[test.copy()['P22250'] > 0] neg_test = test.copy()[test.copy()['P22250'] < 0] # Random Forests # 1-stage prediction # 100 estimators N_ESTIMATORS = 100 rf = ensemble.RandomForestRegressor(n_estimators=N_ESTIMATORS, min_samples_leaf=1, random_state=3, verbose=True, n_jobs=-1) # Use maximum number of cores. rf.fit(train[predictors], train['P22250']) # rf_preds = array of estimators rf_preds = [] for estimator in rf.estimators_: rf_preds.append(estimator.predict(test[predictors])) rf_preds = np.array(rf_preds).transpose() # One row per record. # Validation # Log-loss # We can calculate the RF model's predicted probability for each sign (and thus its log-loss) using the % of estimators predicting that sign for each observation. # Note: `metrics.log_loss()` assumes that the order of the columns of predicted probabilities correspond to their categories' alphabetical order. Our categories are `'neg'`, `'zer'` and `'pos'`, which have the alphabetical order of `'neg'`, `'pos'`, `'zer'`, thus they appear in that order in `rf_pred_proba` as `[preds_neg, preds_pos, preds_zer]` preds_neg = np.sum(rf_preds < 0, axis=1) / 100 preds_zer = np.sum(rf_preds == 0, axis=1) / 100 preds_pos = np.sum(rf_preds > 0, axis=1) / 100 rf_pred_proba = list(map(list, zip(*[preds_neg, preds_pos, preds_zer]))) metrics.log_loss(test['sign'], rf_pred_proba) # Continuous prediction # Random estimator selected so that imputation is stochastic rand_col = np.random.randint(N_ESTIMATORS, size=rf_preds.shape[0]) random_tree = rf_preds[np.arange(rf_preds.shape[0]), rand_col] pred_random_tree = pd.DataFrame({'actual': test['P22250'], 'actual_sign': test['sign'], 'pred': random_tree}) pred_random_tree['error'] = pred_random_tree.pred - pred_random_tree.actual pred_random_tree['pred_sign'] = np.where(pred_random_tree['pred'] == 0, 'zer', np.where(pred_random_tree['pred'] > 0, 'pos', 'neg')) pred_random_tree['correct_sign'] = ( pred_random_tree.actual_sign == pred_random_tree.pred_sign) pred_random_tree['count'] = 1 # RMSE on whole test set pred_random_tree.error.pow(2).mean() ** 0.5 # RMSE on positive data pred_random_tree[pred_random_tree['actual'] > 0]['error'].pow(2).mean()**0.5 # RMSE on negative data pred_random_tree[pred_random_tree['actual'] < 0]['error'].pow(2).mean()**0.5 # OLS # Positive data # Regressing on log(P22250) # None of the non-zero observations of E01100 made it into training data, so pos_train['E01100'] # is just a column of zeroes, and is thus excluded. # Predictors included are those with p values <= 0.1 ols_pos_predictors = [ 'constant', 'DSI', 'EIC', 'XTOT', 'E00200', 'E00300', 'E00400', 'E00600', 'E01400', 'E18400', 'E18500', 'E19200', 'E19800_E20100', 'E20400', 'E32800' ] ols_pos_fit = sm.OLS(pos_train['log_P22250'], pos_train[ols_pos_predictors]).fit() # RMSE is massive ols_pos_pred = np.exp(ols_pos_fit.predict(pos_test[ols_pos_predictors])) (pos_test['P22250'] - ols_pos_pred).pow(2).mean()**0.5 # Regressing on log(P22250), using log-transformed continuous predictors # Predictors included are those with p values <= 0.1 ols_logpos_predictors = [ 'constant', 'DSI', 'MARS3', 'N24', 'log_E00200', 'log_E00300', 'log_E00400', 'log_E00600', 'log_E01500', 'log_E02400', 'log_E03210', 'log_E03270', 'log_E03300', 'log_E17500', 'log_E18400', 'log_E19200', 'log_E20400', 'log_E32800' ] ols_logpos_fit = sm.OLS(pos_train['log_P22250'], pos_train[ols_logpos_predictors]).fit() # RMSE is lower than random forest's positive predictions ols_logpos_pred = np.exp(ols_logpos_fit.predict(pos_test[ols_logpos_predictors])) (pos_test['P22250'] - ols_logpos_pred).pow(2).mean()**0.5 # Negative data # Regressing on log(P22250) # Predictors with p value <= 0.1 ols_neg_predictors = [ 'constant', 'DSI', 'EIC', 'MARS3', 'E00200', 'E00300', 'E00400', 'E00600', 'E01400', 'E01500', 'E02100', 'E02400', 'E03210', 'E03270', 'E18400', 'E18500', 'E19200', 'E20400', 'E32800', 'F2441' ] ols_neg_fit = sm.OLS(neg_train['log_P22250'], neg_train[ols_neg_predictors]).fit() # RMSE is even larger than positive model regressing on non-log-transformed continuous predictors ols_neg_pred = -np.exp(-ols_neg_fit.predict(neg_test[ols_neg_predictors])) (neg_test['P22250'] - ols_neg_pred).pow(2).mean()**0.5 # Regressing on log(P22250), using log-transformed continuous predictors # Predictors included are those with p values <= 0.1 ols_logneg_predictors = [ 'constant', 'DSI', 'EIC', 'MARS3', 'XTOT', 'F2441', 'log_E00200', 'log_E00300', 'log_E00400', 'log_E00600', 'log_E01500', 'log_E01700', 'log_E03270', 'log_E03300', 'log_E17500', 'log_E18400', 'log_E19200', 'log_E19800_E20100', 'log_E20400' ] ols_logneg_fit = sm.OLS(neg_train['log_P22250'], neg_train[ols_logneg_predictors]).fit() # RMSE is again much better when regressing on log(predictors) ols_logneg_pred = -np.exp(-ols_logneg_fit.predict(neg_test[ols_logneg_predictors])) (neg_test['P22250'] - ols_logneg_pred).pow(2).mean()**0.5 ```
MaxGhenis commented 5 years ago

Thanks Avi, does this chart represent the results correctly?

Method RMSE on full test data RMSE on positive test data RMSE on negative test data
RF random tree (not point estimate) 886,796 1,227,639 999,927
OLS log outcome, nonlog predictors 19,746,052,052 890,112 23,007,880,474,031
OLS log outcome, log predictors Not produced Not produced 602,248

Can you use the RF predict function instead of the random tree? This will produce the point estimate, which would be apples-to-apples with the OLS models (it will have much better MSE as well). The analog for the random tree to OLS models would be a random point from each observation's predicted CDF. This will be useful for quantile loss, but not MSE as much.

I suggested in https://github.com/open-source-economics/taxdata/issues/267#issuecomment-408935003 building a full 2-stage model to compare against RF. I think that would be the most beneficial next step. At this point the only valid comparison is 886,796 to 19,746,052,052; segmenting this by the true label is only helpful if the pos/0/neg classifier is perfect.

I'd also recommend a different transformation to deal with zeros or negative values. The one in the notebook (np.where(x == 0, 0, np.sign(x) * np.log(abs(x)))) maps the same value for, say, 0.5 and -2 (-0.69), and maps zero above this (0). The simplest approach is adding the feature's minimum value plus one to the feature, then log-transform that. A more robust approach could be this.

Abraham-Leventhal commented 5 years ago

Good idea to make a chart. I made some corrections and added cell references:

(these data are from this OLS vs RF notebook)

Method RMSE on full test data RMSE on positive test data RMSE on negative test data
RF random tree (not point estimate) 886,796 (cell 9) 1,227,639 (cell 10) 999,927 (cell 11)
OLS log outcome, nonlog predictors Not produced 19,746,052,052 (cell 13) 23,007,880,474,031 (cell 17)
OLS log outcome, log predictors Not produced 890,112 (cell 15) 602,248 (cell 19)

Regarding the point estimate vs random tree yes I can do that, that makes sense. Though, long-term I understand the goal would be to use a random tree (if RF) or random point from the CDF (if OLS) so that the imputation is stochastic, correct?

I suggested in #267 (comment) building a full 2-stage model to compare against RF. I think that would be the most beneficial next step.

OK, just to clarify, the best method of testing the branched approach would be to first ensure we're using the same imputation method as we do in RF (either point estimate, or random point/tree), then:

Regarding the logs: thanks for pointing that out! Here's what seems to be the core of the R code from the linked "robust method:"

z <- (x - min(x)) / (max(x) - min(x)) * 2 - 1
z <- z[-min(z)]
z <- z[-max(z)]
t <- atanh(z)

x would be the predictor column to be transformed, t would be the final transformed output. My question is, what are the middle two lines doing? They appear to squash z to a scalar when I test it in R, making t <- atanh(z) throw an exception.

Thank you very much for your detailed input

MaxGhenis commented 5 years ago

long-term I understand the goal would be to use a random tree (if RF) or random point from the CDF (if OLS) so that the imputation is stochastic, correct?

Yes, though since this will be challenging with the two-stage linear method, better to check that the MSEs are comparable first.

Your steps LGTM. I'd think about it as creating a new sort of predict function (you could call it something like predict_2stage), which takes a record and produces a prediction from the two-stage model. This will first assign -/0/+ based on a random number and the CDF, then a value within -/+ based on the OLS models. Unlike RF, even for calculating something like a point estimate, this is stochastic due to the -/0/+ labeling. Once you build this function, you can pass the test set and compare RMSE to RF.

You're right on the SO answer, which is erroneous. A correct approach would be something like atanh(((x - min(x)) / (max(x) - min(x)))) but this yields infinity for the minimum and maximum, so you'd want to add a buffer. For this data it might also make sense to make it symmetric to sit along zero.

While this appears to have good statistical properties, I can't find a real name for it (it's not quite the Fisher transformation as the author suggests), so I'd probably just do the more common log(x + min(x) + 1).

Abraham-Leventhal commented 5 years ago

This will first assign -/0/+ based on a random number and the CDF, then a value within -/+ based on the OLS models. Unlike RF, even for calculating something like a point estimate, this is stochastic due to the -/0/+ labeling.

If I understand you correctly, when I am testing a full 2-stage model using a stochastic sign prediction, the final prediction will be stochastic even if I use the point estimate from the OLS models rather than the get_prediction function?

so I'd probably just do the more common log(x + min(x) + 1).

OK, I will implement this and share the results.

Abraham-Leventhal commented 5 years ago

Also, with regards to log-transforming our predictors, I understand that one indication that log transformation might be useful is if it makes your predictors normally distributed.

Here is a notebook where I compare the distributions of our continuous predictors before and after log-transforming them using log(x + sign(min(x))*min(x) + 1). I multiply min(x) by sign(min(x)) to ensure all the inputs to log() were positive: log predictors notebook.

(please disregard the repeat E02100 non-log/log charts, I don't know why they were copied)

This would indicate why our predictions are better when log transforming some of our predictors, but also that other predictors might not benefit from being log transformed. I would also be open to adding a residual analysis to this notebook, as I understand its residuals that may be more important to this question of log-transforming predictors.

MaxGhenis commented 5 years ago

when I am testing a full 2-stage model using a stochastic sign prediction, the final prediction will be stochastic even if I use the point estimate from the OLS models rather than the get_prediction function?

That's what I was thinking, but there are two other options:

  1. Split up each observation into 3 observations: -/0/+, weighted by the probability of that class.
  2. Take the expected value of the observation using the above splitting: E(x | x < 0) * P(x < 0) + E(x | x > 0) * P(x > 0).

I wouldn't expect wildly different MSEs between the approaches, so unless others have a preference, I'd go with whatever's easiest.

Thanks for the log predictors notebook, really helpful visualization. None of the features look significantly less normal without the log-transformation, so I'd go ahead and log-transform all (except low counts you're already appropriately excluding like EIC).

I miswrote earlier with log(x + min(x) + 1), I meant log(x - min(x) + 1). Compared to using sign, this avoids shifting the distribution further right when min(x) > 0. Here's a quick illustration of the methods.

Abraham-Leventhal commented 5 years ago

@MaxGhenis

As I am ending this internship next week Aug 14, I will be summarizing my work over the summer in a markdown file that contains things we've learned about the data and the various models (mainly RF / mnlogit+OLS) that I've tested, as well as some recommendations going forward. That file will include your recommendations to look at: quantile loss, deep quantile regression, LassoCV, and the need to decide on a model comparison metric.

That being said, given the short time left, I will probably not be able to investigate all of those options. This post I think shows the issues with just using RMSE.


Here is the notebook which: (I hope) correctly log-transforms the columns (thanks for your illustration!), uses rf.predict() as opposed to a random tree, and compares RF to a full 2-stage mnlogit + OLS imputation, so the comparison should be apples-to-apples: RF.predict() vs mnlogit+OLS using correctly log-transformed predictors.

Below are the results plus cell references:

Method RMSE
RF (point estimate) 411,117 (cell 6)
OLS log outcome, log predictors 389, 051 (cell 16)
Just predicting zero 385,182 (cell 18)
Just predicting the mean 385,168 (cell 19)

These models run on the puf data without two low outliers, with -124 M and -60 M in P22250 respectively. They were causing the negative OLS model to have an R-squared of 0.000 and 2 significant predictors. Removing them improved RMSE for both models.

As you can see, mnlogit + OLS beats RF in RMSE. Imputing a 0 beats both methods, and imputing the mean (-3291) performs better than imputing a 0. I would think this indicates that RMSE is not the best metric, as we wouldn't prefer a dataset with all-zero capital gains or all -3291 over one with some nuance that captures the fact that some households in the 0 and 20+ AGI will have extreme values.

Cells 20-23 explore the relationship between P22250 and AGI in the actual data, and in the models. Each set of graphs looks at a different y-scale, the green lines on the mnlogit graphs indicate where the scale will be on the next graph. RF appears to be the best model by these graphs, but I would like to do some more analysis and perhaps figure out how to add density shading to make the graphs more meaningful. What's notable is that mnlogit + OLS isn't predicting any values greater than about 600,000 or less than around -400,000.

Let me know what you think, particularly any analyses you think would perhaps reveal why mnlogit + OLS is doing better in RMSE. Is it just because its predicting smaller values, and when RF's mistakes have a higher cost in RMSE (especially if it gets the sign wrong)?


Also, the log-transformation code I used is below for reference, in case there's an error (X is any given column) (cell 2)

if puf[X].min() >= 1:

            puf['log_' + X] = np.log(puf[X])

puf['log_' + X] = np.log(puf[X] - puf[X].min() + 1)

And the back-transformation on the resulting log predictions (cell 14):

predictions = np.exp(log_predictions) + puf['P22250'].min() - 1
martinholmer commented 5 years ago

Avi @Abraham-Leventhal, I want to thank you for all your work on taxdata issues #221 and #261. All the descriptive statistics and comparison of different estimation methods will be extremely valuable for whomever continues this imputation work. Best wishes for the next school year.

MaxGhenis commented 5 years ago

I second @martinholmer's comment: you made significant strides in the taxdata imputation space, and your work will be valuable beyond these specific ones into any imputations in the project. I'm glad we got to collaborate, and hope you have a great upcoming school year. You can always continue some of this outside AEI!

Reviewing your notebook, one thing I wasn't sure of is how the zero-predictions were modeled. Cell 14 appears to produce ols_predictions without them, unless I'm misreading?

ols_predictions = pd.concat([neg_predictions, pos_predictions], axis = 0).to_frame()

Let me know what you think, particularly any analyses you think would perhaps reveal why mnlogit + OLS is doing better in RMSE. Is it just because its predicting smaller values, and when RF's mistakes have a higher cost in RMSE (especially if it gets the sign wrong)?

My guess is that RF is less conservative: it takes some more risks, as you can see from the below chart. This and the zero/mean outperforming models show that MSE is an insufficient criterion, given the importance of tax analysis in predicting more outliers as the RF does. I'm not sure more analysis is needed of MSE, just to move on to quantiles.

image

Abraham-Leventhal commented 5 years ago

@martinholmer Thanks, I appreciate it. And thank you very much for your help throughout this project!

@MaxGhenis

Reviewing your notebook, one thing I wasn't sure of is how the zero-predictions were modeled. Cell 14 appears to produce ols_predictions without them, unless I'm misreading?

My logic wasn't clearly written out here. ols_predictions does not include the rows which were predicted by mnlogit to be zero, so it is shorter than test, but its row labels are consistent with test. Thus, when I assign ols_predictions to a new column in test as I do below:

test['mnlOLS_p'] = ols_predictions

...the continuous predictions will slot into the appropriate row in test, and all rows not contained in ols_prediction will get a NaN. Since these rows are just those rows which were predicted to be zero by mnlogit, I assign them a zero in the code just below:

test['mnlOLS_p'] = test['mnlOLS_p'].fillna(0)

Does that answer your question?

I'm not sure more analysis is needed of MSE, just to move on to quantiles.

Agreed! And thank you as well for your kind words and your help throughout my internship.

MaxGhenis commented 5 years ago

This notebook applies various methods to predicting quantiles of e00900 in the CPS, as an example that could be ported to PUF for CPS imputation. It uses the same features @Abraham-Leventhal used for the most recent analysis, and predicts against a 20% holdout at the 10th, 30th, 50th (median), 70th, and 90th percentiles.

In this case, random forests performs best of all methods, outperforming the trinomial logit + linear model (with the same log-transformations as Avi did, labeled LogitOLS) by 57% on overall quantile loss, and the next-best approach (Keras and TensorFlow, both deep neural network approaches) by 39%. LogitOLS outperforms a standard OLS and quantile linear regression, neither of which were log-transformed. image

These patterns were consistent across the five quantiles: image

As one example, this case had the highest quantile loss in the test data set for the LogitOLS method, which (along with all non-RF methods) gave essentially zero probability to its high true value. RF predicted a ~25% chance that this record would have a value as high as it did. image

It's possible that some of the other methods could do better than random forests, given deep neural networks and even gradient boosting often improve with hyperparameter tuning. But random forests has an advantage over all these methods in that you only need to train one model, from which any quantile can be obtained. Keras, TensorFlow, gradient boosting, and quantile regression all require training new models for each quantile, meaning we'd have to choose some specific ones rather than predict the full CDF.

Based on this large predictive improvement, I'd recommend moving forward with random forests for this task, and conducting a similar analysis for other taxdata/C-TAM imputations. Similar analysis of two other high-dimensional datasets show that random forests significantly outperforms linear models, so I would expect improvements in other taxdata applications.

martinholmer commented 5 years ago

@MaxGhenis, I have (at least) two questions about the results you present in this comment.

  1. Why does random forests work better than standard econometric methods (logit+OLS) when you do the comparison, but the result was the other way around when @Abraham-Leventhal did the comparison?

  2. What is the fitted random forests model? That is, what's its functional form and what are the fitted parameter values? Did I miss that information in you notebook?

@MattHJensen @andersonfrailey

MaxGhenis commented 5 years ago

Why does random forests work better than standard econometric methods (logit+OLS) when you do the comparison, but the result was the other way around when @Abraham-Leventhal did the comparison?

Avi found that a) the logit slightly outperforms the random forests on classifying into negative/zero/positive b) the logit+OLS slightly outperforms RF on MSE

Neither of these capture the aim of the procedure, which is an appropriate CDF. Avi's finding that predicting a constant zero for all cases outperforms both RF and logit+OLS on MSE is good evidence of this; clearly this would not be a useful model.

Quantile losses capture the fit of CDFs; here's how it's calculated:

def quantile_loss(q, y, fitted):
    e = y - fitted
    return np.maximum(q * e, (q - 1) * e)

That is, quantile loss is lower for negative losses when the quantile is low, and vice versa, with the median being symmetric: image

While OLS has parallel quantiles, nonparametric methods allow for more flexible quantiles. Here's an example from a single-feature model from different data: image image

In the taxdata e00900 case, OLS prediction intervals are too conservative to capture outliers, which are important for taxcalc analyses. This is evident in some of Avi's charts, the sample case of RECID 255187, as well as my "Quantile loss per quantile" chart, which shows that RF and logit+OLS are comparable at predicting the 10th percentile (when most records should have predictions of zero), but RF is far better at predicting the 90th percentile.

What is the fitted random forests model? That is, what's its functional form and what are the fitted parameter values? Did I miss that information in you notebook?

Random forests models are nonparametric sets of trees. To predict from them on new data involves saving and loading model objects rather than hardcoding coefficients.

martinholmer commented 5 years ago

@MaxGhenis said in response to my question:

What is the fitted random forests model? That is, what's its functional form and what are the fitted parameter values? Did I miss that information in you notebook?

Random forests models are nonparametric sets of trees. To predict from them on new data involves saving and loading model objects rather than hardcoding coefficients.

So, you seem to be saying the random forests models are what we might call "black boxes", is that right? Personally, I don't think using opaque models is consistent with the transparency we're striving for in an open-source project.

@MattHJensen @andersonfrailey

MaxGhenis commented 5 years ago

The model object is fully reproducible and can be stored on GitHub. Many machine learning projects are open-source.

martinholmer commented 5 years ago

@MaxGhenis said in issue #267:

The model object is fully reproducible and can be stored on GitHub. Many machine learning projects are open-source.

We are also able to store encrypted zip files on GitHub, which does not mean they are transparent. And being "fully reproducible" is not the same thing as being transparent, especially when the estimation (or "training") data are not publicly available. People looking at our project want to understand what we've done without having to fully reproduce what we've done.

@MattHJensen @andersonfrailey

MaxGhenis commented 5 years ago

There are tools to look inside models like random forests, such as feature importance, which quantifies the model improvement from including the feature in the model. I find these more helpful for understanding models than coefficients and standard errors from OLS models, since you don't have to standardize them back to an interpretable scale, rely on p-values vs. effect size, or worry as much about collinearity (RF importance also accounts for nonlinear relationships). Here's what this looks like for this model: image

I'm surprised that opposition to nonparametric models is coming at this stage, given it's been part of the discussion for months. In https://github.com/open-source-economics/taxdata/issues/244#issuecomment-405619718, @andersonfrailey said "we're going for accurate data more than anything else," and this has guided work from Avi, myself, and others. Applying ML to prediction problems is now a routine component of economics curricula, including undergrad, and this is a typical prediction problem to apply it toward. I don't think linearity justifies implementing a model with over double the loss.

If you disagree, would you consider running the random forests model and storing it for others to apply their own imputations?

MattHJensen commented 5 years ago

I don't like the idea of rejecting all nonparametric models on transparency grounds. I support open source and transparency because I think they will lead to better policy predictions and better policies.

If a modeling decision ever were to come down to transparency or predictive accuracy, I would choose predictive accuracy. That's why, when using Tax-Calculator, I am ok using the IRS public use file to begin with.

Is there disagreement over which model will lead to better predictive accuracy for Tax-Calculator users?

In addition to predictive accuracy, I'd suggest considering ease of maintenance.

martinholmer commented 5 years ago

@MattHJensen said, among other things, in taxdata issue #267:

Is there disagreement over which model will lead to better predictive accuracy for Tax-Calculator users? In addition to predictive accuracy, I'd suggest considering ease of maintenance.

Yes, I don't believe @MaxGhenis made correct or sensible comparisons of "predictive accuracy" in #267.

I agree that "ease of maintenance" should be considered along with "predictive accuracy".

rickecon commented 5 years ago

These are two different questions. The question of transparency is being confounded with the question of model choice. Let me propose that all models that we would use in this case are transparent in that textbooks and peer-reviewd papers have been written about their derivation, and use, and open source software packages have been produced to implement them. The question of model selection depends on one's objective.

Nonparametric models are only in the sense that they are complex and often nonlinear. However, the algorithms used to estimate them, the software that implements the algorithms, and the theory behind the algorithms are perfectly transparent. Many textbooks treat the theory, scikit-learn is open source, TensorFlow which interacts with scikit-learn to train the tuning parameters of the models and minimize test set prediction error is open source.

For imputation of the type described in this issue, predictive accuracy seems to be more important than interpretation of how certain variables contributed to an imputed value. If predictive accuracy is most important, statistical learning models (almost always nonparametric) nearly always dominate. If it is important to infer the degree to which certain variables contributed to the imputed value, statistical learning models have developed some new capability in this area, but parametric models will often be better.

MattHJensen commented 5 years ago

I asked:

Is there disagreement over which model will lead to better predictive accuracy for Tax-Calculator users?

@martinholmer said:

Yes, I don't believe @MaxGhenis made correct or sensible comparisons of "predictive accuracy" in #267.

Will you have an opportunity to expand on this? A discussion of which imputation model will lead to greater predictive accuracy for Tax-Calculator users ought to be very interesting/engaging, and also potentially relevant to work @donboyd5 is doing to create a synthetic file.

MaxGhenis commented 5 years ago

In addition to predictive accuracy, I'd suggest considering ease of maintenance.

Here's a notebook that fits a random forests model to the CPS data, saves the model to disk, loads it, and makes stochastic imputations using the loaded file. This seems simpler to me than the current approach of doing extensive preprocessing, running multiple models, and predicting based on hard-coded coefficients.

I don't believe @MaxGhenis made correct or sensible comparisons of "predictive accuracy" in #267.

@martinholmer what would be more correct or sensible in your view? Neither trinomial log-loss nor MSE seem suitable given the use case.

I've drafted a blog post here explaining quantile loss in more depth. I believe this is the appropriate loss for this problem, and would welcome feedback from anyone here (it's open for comments).

martinholmer commented 5 years ago

@MattHJensen asked:

Is there disagreement over which model will lead to better predictive accuracy for Tax-Calculator users?

And @martinholmer answered:

Yes, I don't believe @MaxGhenis made correct or sensible comparisons of "predictive accuracy" in #267.

Then @MattHJensen asked:

Will you have an opportunity to expand on this? A discussion of which imputation model will lead to greater predictive accuracy for Tax-Calculator users ought to be very interesting/engaging, and also potentially relevant to work @donboyd5 is doing to create a synthetic file.

There has been an ongoing discussion of how best to assess the "predictive accuracy" of statistically imputed values since @Abraham-Leventhal started in June working on the project described in taxdata issue #267. That discussion has focused on several methods of assessing "predictive accuracy" but little if any justification has been given for the method being focused on. My understanding of the statistics literature (which could be faulty) is that there is a "gold-standard" test of whether two empirical (cumulative) distribution functions are close to each other. That test is the Kolmogorov–Smirnov (KS) test, which dates back to the 1930s. The KS two-sample test is widely available in R and in Python (via the scipy.stats package).

When I said I didn't think @MaxGhenis had made "sensible" comparisons, I simply meant that he seemed to be making up the comparison test instead of using the standard KS test. And when I said I didn't think @MaxGhenis had made "correct" comparisons, I simply meant that it seems to me (although I could be wrong) that he made a mistake in generating the imputed distribution when using the "trinomial-logit+regression" imputation method. My belief is that if he switches to using the KS test he will quickly discover his mistake.

MaxGhenis commented 5 years ago

The KS test only assesses the similarity of two CDFs, not the accuracy of individual predictions. It would be a bit like using a t-test to assess a model's accuracy at predicting means, rather than the standard MSE. The overall distribution might be fine, but individual predictions might be completely scrambled, and that would worsen tax analysis.

It's also not clear to me how the "stochastic" part of stochastic imputation fits in with the KS test, which typically assesses a distribution of point estimates. Could you expand on this with an example of the metric we would use to assess the predicted distribution of each observation?

I did not make up quantile regression and quantile loss, parts of which arguably predate OLS, and which have 50,000 and 1,000 Google Scholar results, respectively. Applying it to nonparametric models like random forests and deep learning is also not new, as evident also in hundreds of Google Scholar results. I've been unable to find any resources online about combining trinomial logits and OLS for stochastic imputation not written by @martinholmer, but this is probably a terminology issue: stochastic imputation is really another term for picking a random quantile generated from some generalized quantile regression model.

Here's my LogitOLS code, which I adapted from @martinholmer and @Abraham-Leventhal's code. I've checked the results, which look correct.

MattHJensen commented 5 years ago

There is some disagreement over the best way to measure the accuracy of the imputation discussed in this issue. There is even some disagreement over what "accuracy" means in this context: Do we care about closeness of the two distributions, the closeness of individual predictions, or maybe we should take a step back and pay attention to how the imputation affects the accuracy of Tax-Calculator's results.

I think it would be helpful for others to be able to weigh in on these issues (I have suggested that @donboyd5 could be helpful, @codykallen may as well, as could, potentially, others), but I'm hesitant to suggest that anyone wade through the issue's 50 comments.

Would someone be willing to start a google doc or similar that summarizes the goals of the issues, the areas of agreement, and, in particular, the recent areas of disagreement?