Impute itemized deduction amounts to non-itemizers

PSLmodels / taxdata

The TaxData project prepares microdata for use with the Tax-Calculator microsimulation project.

http://pslmodels.github.io/taxdata/

Other

21 stars 30 forks source link

Impute itemized deduction amounts to non-itemizers #32

Closed MattHJensen closed 6 years ago

MattHJensen commented 8 years ago

We should impute itemized deduction amounts to non-itemizers so that we can simulate reforms that increase the number of itemizers.

This issue was moved from https://github.com/open-source-economics/Tax-Calculator/issues/230.

@GoFroggyRun recently took on this project. @GoFroggyRun, could you please post an update on your work?

Feel free to link to or attach your and Chi Tran's presentations or any other information you think might be relevant.

GoFroggyRun commented 8 years ago

@MattHJensen:

I have just finished the write-up (pdf attached), and the imputed version of 09 puf is also available. I'm aware that we'll be switching to 10 puf soon, and imputing 10 puf should be very efficient if we are not revising/updating our current donor (14 CEX).

I'll be doing some code streamline and having all code posted on github once done. Meanwhile, please let me know if there're any comments, concerns or remarks regarding the write-up.

imputation_fin.pdf

MattHJensen commented 8 years ago

@GoFroggyRun, I think it would be very helpful to do add a case study of converting an itemized deduction into a credit. That would give users confidence in the technique. You could incorporate the scores with and without the info for standard deduction filers. There could be some good material for comparisons in this document. http://www.cbo.gov/sites/default/files/cbofiles/ftpdocs/121xx/doc12167/charitablecontributions.pdf

martinholmer commented 8 years ago

Sean Wang (@GoFroggyRun) and Chi Tran,

I've briefly read your 22-Aug-2016 paper entitled "On Cold-Deck Imputation with Data Quality Improvement Using Simulation Model". I have three types of comments: some stylistic suggestions regarding the paper, a substantive issue regarding what you did in the imputation work, and an issue about the lack of imputation results in the paper.

Stylistic Issues Given the technical nature of imputation, the paper must use technical terminology. But generally, technical papers point to references that provide more detailed descriptions of terminology and methods. Your paper uses a lot of technical terms that are not referenced. This style makes the paper very hard reading for even people with microsimulation experience. Part of the problem may be related to the fact that the paper never cites references [3] and [4]. Do those books discuss hot-deck and cold-deck imputation, nearest-neighbor and scaled-mean estimation, etc.? If so, you should cite those books when you first introduce a technical concept. If not, then you need to add to the References and cite the new ones when appropriate. Also, there is no citation for "Lorez Kueng's methodology from the Cex-TAXSIM project" [page 6]. There is no citation for the OSPC Tax-Calculator. And there are other examples.

Substantive Issue I'm concerned about what you did to construct a MARS variable for CEX-derived tax units, as described in the "Re-sampling Married CUs" section of the paper on pages 6-7. In particular, I don't understand this: "... Head of households (type 4) are assigned to reference people from CUs with more than one family member but where only one of which makes an income." But this includes many married couples (with or without children) where only one of the couple has earnings. If I understand correctly what you have done, this seems like a serious error in logic.

Imputation Results Readers of your paper are going to be very disappointed when they get to the end because you provide no descriptive statistics about the statistical distribution of the imputed variables. You must add a section that describes the results of your imputation process: the distribution of the imputed values for each imputed variable and the distribution of the sum of the imputed variables, at a minimum.

@GoFroggyRun @MattHJensen @feenberg @Amy-Xu

GoFroggyRun commented 8 years ago

@martinholmer,

Thank you so much for your thoughtful and detailed comments on the write-up.

Regarding your concerns:

Stylistic Issues

But generally, technical papers point to references that provide more detailed descriptions of terminology and methods. Your paper uses a lot of technical terms that are not referenced.

I am aware of that, and will try to pin-point my citation to make it more reader-friendly.

Do those books discuss hot-deck and cold-deck imputation, nearest-neighbor and scaled-mean estimation, etc.?

Yes. They are, however, being discussed somehow discursively. I will try to find a good to to cite them.

there is no citation for "Lorez Kueng's methodology from the Cex-TAXSIM project" [page 6]. There is no citation for the OSPC Tax-Calculator. And there are other examples.

Right. Since they are web pages, I wasn't quite sure what's the best way to cite them.

Substantive Issue

My apologies that his part might look a bit confusing. There's one variable in the CEX dataset called MARITAL1, which is, to some extent, equivalent to our MARS variable in puf. Married units have MARITAL1 value of 1, while there are other values suggesting divorced, never married and etc. To be classified as "Head of Household" or "single", one needs to be, at least, not married. I will put more details on that part to avoid any confusions or misunderstandings.

Imputation Results

Sure. I'll include a case study, as suggested by @MattHJensen, as well as some statistical distributions of imputed variables in the write-up.

Thanks again for all your comments. And, as always, any comments, concerns or remarks would be more than appreciated.

martinholmer commented 8 years ago

Sean Wang (@GoFroggyRun) said:

My apologies that his part might look a bit confusing. There's one variable in the CEX dataset called MARITAL1, which is, to some extent, equivalent to our MARS variable in puf. Married units have MARITAL1 value of 1, while there are other values suggesting divorced, never married and etc. To be classified as "Head of Household" or "single", one needs to be, at least, not married. I will put more details on that part to avoid any confusions or misunderstandings.

That sounds more reasonable, but is not the impression that the current draft gives.

GoFroggyRun commented 7 years ago

@MattHJensen @martinholmer:

I have finished a revision of the draft (please find attachment) that addresses concerns mentioned in previous discussions. Thanks for @martinholmer's careful review and thoughtful comments.

Before proceeding to any imputation-related reforms and comparisons (as suggested by @MattHJensen ), I'd be interested in checking in with you guys and see what reforms would be helpful, since there're a lot of reforms included in the report and some of the reforms are not applicable in TC. Any ideas would be appreciated.

And feel free to let me know if there's any further comments, concerns or remarks regarding the revision.

imputation_041417.pdf

martinholmer commented 7 years ago

@GoFroggyRun, I've finally had a chance to read your revised description of imputing itemized expense amounts for non-itemizers. That description was attached to the conversation of taxdata issue #32 during April 2017.

This latest version is much improved, so thanks for all the extra work. I've have a suggestion and several questions.

(1) Need a shorter and more descriptive paper title.

How about Imputing Deductible Expense Amounts for Non-Itemizers?

This describes what you are doing and is shorter (so that the page numbers in the LaTeX header don't get swallowed by the long title).

(2) Questions about Data Cleaning at bottom of page 3.

I don't understand why you have dropped these three groups from the CEX sample. (a) You are splitting families into filing units, so why can't you split CEX consumer units into families? (b) Why not treat "surviving spouse units" as single or head of household filing units depending on whether or not they have dependents? (c) Why are CEX units without earnings being dropped? Who is left in the CEX sample to use to impute to non-itemizing PUF retirees, who will most likely have zero earnings but positive social security and/or pension income. This seems like a big mistake, but maybe further explanation can change my mind about that.

(3) Questions about categorizing CEX data by earnings and PUF data by income.

At top of page 7, there is a confusing description where CEX units categorized by earnings group seem to be compared with PUF units categorized by income. I don't understand how that can be done in a sensible way. Perhaps more explanation will eliminate this issue.

(4) Beginning in Section 4 there is no description of what the six e variables mean.

Why not give the reader a break and explain in words what the deductible expense variables mean? Also, why is e18500 (real-estate taxes paid) not imputed? Seems like we still have a problem after all your imputation work because we still have a major deductible expense missing.

(5) Question about what the phrase "non-ordinal categorial variable" means.

I didn't see anyplace in the paper that describes what you mean by this term. Can you explain?

(6) Questions about the imputed distributions shown on page 13.

This is my biggest concern with what you have done (as far as I can tell from the paper). The six variables have imputed-value distributions on page 13 that are very different from what I would expect. For example, I would expect among non-itemizers that most would have zero non-cash charitable contributions and that a few would be positive non-cash charitable contributions. But the distribution for e20100 on page 13 shows most non-itemizers having a value of about $1400 and almost none having a value of zero. Why is that? Is my expectation about this variable's distribution in the CEX subsample of non-itemizers mistaken? Or, by taking the imputed value to be the average of the nearest 80 neighbors (if I'm understanding correctly what you're doing) are you distorting the CEX distribution of this variable? Put another way, I don't see how your imputation method handles correctly the mass point a zero for these six variables. Maybe more explanation would answer my question.

@MattHJensen @feenberg @Amy-Xu @andersonfrailey

feenberg commented 7 years ago

Where is the document discussed here?

dan feenberg

On Mon, 1 May 2017, Martin Holmer wrote:

@GoFroggyRun, I've finally had a chance to read your revised description of imputing itemized expense amounts for non-itemizers. That description was attached to the conversation of taxdata issue #32 during April 2017.

This latest version is much improved, so thanks for all the extra work. I've have a suggestion and several questions.

(1) Need a shorter and more descriptive paper title.

How about Imputing Deductible Expense Amounts for Non-Itemizers?

This describes what you are doing and is shorter (so that the page numbers in the LaTeX header don't get swallowed by the long title).

(2) Questions about Data Cleaning at bottom of page 3.

I don't understand why you have dropped these three groups from the CEX sample. (a) You are splitting families into filing units, so why can't you split CEX consumer units into families? (b) Why not treat "surviving spouse units" as single or head of household filing units depending on whether or not they have dependents? (c) Why are CEX units without earnings being dropped? Who is left in the CEX sample to use to impute to non-itemizing PUF retirees, who will most likely have zero earnings but positive social security and/or pension income. This seems like a big mistake, but maybe further explanation can change my mind about that.

(3) Questions about categorizing CEX data by earnings and PUF data by income.

At top of page 7, there is a confusing description where CEX units categorized by earnings group seem to be compared with PUF units categorized by income. I don't understand how that can be done in a sensible way. Perhaps more explanation will eliminate this issue.

(4) Beginning in Section 4 there is no description of what the six e variables mean.

Why not give the reader a break and explain in words what the deductible expense variable mean? Also, why is e18500 (real-estate taxes paid) not imputed? Seems like we still have a problem after all your imputation work because we still have a major deductible expense missing.

(5) Question about what the phrase "non-ordinal categorial variable" means.

I didn't see anyplace in the paper that describes what you mean by this term. Can you explain?

(6) Questions about the imputed distributions shown on page 13.

This is my biggest concern with what you have done (as far as I can tell from the paper). The six variables have imputed-value distributions on page 13 that are very different from what I would expect. For example, I would expect among non-itemizers that most would have zero non-cash charitable contributions and that a few would be positive non-cash charitable contributions. But the distribution for e20100 on page 13 shows most non-itemizers having a value of about $1400 and almost none having a value of zero. Why is that? Is my expectation about this variable's distribution in the CEX subsample of non-itemizers mistaken? Or, by taking the imputed value to be the average of the nearest 80 neighbors (if I'm understanding correctly what you're doing) are you distorting the CEX distribution of this variable? Put another way, I don't see how your imputation method handles correctly the mass point a zero for these six variables. Maybe more explanation would answer my question.

@MattHJensen @feenberg @Amy-Xu @andersonfrailey

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVQxfqxA1ZwYRf6IaPQHYzWsTDivAks5r1hiDgaJpZM4JcdFn.gif]

martinholmer commented 7 years ago

On Mon, May 1, 2017 at 2:44 PM, Daniel Feenberg notifications@github.com wrote:

Where is the document discussed here?

I've attached the pdf to this email.

dan feenberg

On Mon, 1 May 2017, Martin Holmer wrote:

@GoFroggyRun, I've finally had a chance to read your revised description of imputing itemized expense amounts for non-itemizers. That description was attached to the conversation of taxdata issue #32 during April 2017.

This latest version is much improved, so thanks for all the extra work. I've have a suggestion and several questions.

(1) Need a shorter and more descriptive paper title.

How about Imputing Deductible Expense Amounts for Non-Itemizers?

This describes what you are doing and is shorter (so that the page numbers in the LaTeX header don't get swallowed by the long title).

(2) Questions about Data Cleaning at bottom of page 3.

I don't understand why you have dropped these three groups from the CEX sample. (a) You are splitting families into filing units, so why can't you split CEX consumer units into families? (b) Why not treat "surviving spouse units" as single or head of household filing units depending on whether or not they have dependents? (c) Why are CEX units without earnings being dropped? Who is left in the CEX sample to use to impute to non-itemizing PUF retirees, who will most likely have zero earnings but positive social security and/or pension income. This seems like a big mistake, but maybe further explanation can change my mind about that.

(3) Questions about categorizing CEX data by earnings and PUF data by income.

At top of page 7, there is a confusing description where CEX units categorized by earnings group seem to be compared with PUF units categorized by income. I don't understand how that can be done in a sensible way. Perhaps more explanation will eliminate this issue.

(4) Beginning in Section 4 there is no description of what the six e variables mean.

Why not give the reader a break and explain in words what the deductible expense variable mean? Also, why is e18500 (real-estate taxes paid) not imputed? Seems like we still have a problem after all your imputation work because we still have a major deductible expense missing.

(5) Question about what the phrase "non-ordinal categorial variable" means.

I didn't see anyplace in the paper that describes what you mean by this term. Can you explain?

(6) Questions about the imputed distributions shown on page 13.

This is my biggest concern with what you have done (as far as I can tell from the paper). The six variables have imputed-value distributions on page 13 that are very different from what I would expect. For example, I would expect among non-itemizers that most would have zero non-cash charitable contributions and that a few would be positive non-cash charitable contributions. But the distribution for e20100 on page 13 shows most non-itemizers having a value of about $1400 and almost none having a value of zero. Why is that? Is my expectation about this variable's distribution in the CEX subsample of non-itemizers mistaken? Or, by taking the imputed value to be the average of the nearest 80 neighbors (if I'm understanding correctly what you're doing) are you distorting the CEX distribution of this variable? Put another way, I don't see how your imputation method handles correctly the mass point a zero for these six variables. Maybe more explanation would answer my question.

@MattHJensen @feenberg @Amy-Xu @andersonfrailey

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVQxfqxA1ZwYRf6IaPQHYzWsTDivAks5r1hiDgaJpZM4JcdFn.gif]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-source-economics/taxdata/issues/32#issuecomment-298399270, or mute the thread https://github.com/notifications/unsubscribe-auth/ALm1-deXEB7E12d65Quj696vZwgpYDbyks5r1ifwgaJpZM4JcdFn .

martinholmer commented 7 years ago

@GoFroggyRun, Let me amplify the concerns I expressed in my question (6), which was posed in taxdata issue #32 on May 1, 2017.

I have no idea what the distribution of non-cash charitable contributions (e20100) is in the CEX sample of non-itemizers you constructed. But I can tabulate the e20100 distribution among 2013 itemizers in the puf.csv sample. The distribution I tabulate below shows that roughly half have a zero value for e20100, which is a distribution that is very different than the one you impute to non-itemizers.

Here is what I did:

$ tc puf.csv 2013 --sqldb
$ sqlite3 puf-13-#-#.db
SQLite version 3.13.0 2016-05-18 10:57:30
Enter ".help" for usage hints.
sqlite> select count(*),round(sum(s006)*1e-6,2) from dump;
219814|163.1
sqlite> select count(*),round(sum(s006)*1e-6,2) from dump where c21060>0;
88130|42.12
sqlite> select count(*),round(sum(s006)*1e-6,2) from dump where c21060>0 and e20100>0;
42019|21.09
sqlite> select count(*),round(sum(s006)*1e-6,2) from dump where c21060>0 and e20100>500;
28092|12.86
sqlite> select count(*),round(sum(s006)*1e-6,2) from dump where c21060>0 and e20100>1000;
13354|4.84
sqlite> .quit
$

So, only 21.09 million of the 42.12 million itemizers (about 50 percent) have a positive amount. And only 4.84 million (about 23 percent of the positives and 11 percent of all itemizers) have a 2013 e20100 value larger than $1,000.

But your graph on page 13 shows the vast majority of non-itemizers have imputed values of e20100 around $1,300 and very few have zero.

Can the basic shape of the e20100 distribution among non-itemizers really be that different from the basic shape of the e20100 distribution among itemizers?

@MattHJensen @feenberg @Amy-Xu @andersonfrailey

GoFroggyRun commented 7 years ago

@martinholmer thanks for your thoughtful comments and follow-up analysis, I'll first partially address your concerns.

(1) Need a shorter and more descriptive paper title. How about Imputing Deductible Expense Amounts for Non-Itemizers?

I don't have a strong preference regarding the title, so I don't have problems with this one.

(2) Questions about Data Cleaning at bottom of page 3. I don't understand why you have dropped these three groups from the CEX sample. (a) You are splitting families into filing units, so why can't you split CEX consumer units into families? (b) Why not treat "surviving spouse units" as single or head of household filing units depending on whether or not they have dependents? (c) Why are CEX units without earnings being dropped? Who is left in the CEX sample to use to impute to non-itemizing PUF retirees, who will most likely have zero earnings but positive social security and/or pension income. This seems like a big mistake, but maybe further explanation can change my mind about that.

For (a), maybe I am confused, but why would us interested in families rather than filing units?

For (b), the amount of observations who are considered "surviving spouse units" is rather insignificant comparing to either single group or HH group. My judgment thus is that it probably does not worth, nor matter, dealing with them. Moreover, having them included in either group could potentially introduce distortion. The quality of donor is much more important than such insignificant increment of amount in sample size, so I' rather not trade it off.

For (c), earning is the factor we used to break down consumer units (CUs) in CEX. When zero, there's no way to determine how to split CUs. Indeed zero earning units can have positive social security and/or pension amount, but introducing these factors will make things more complicated (I prefer generalized treatments over special treatments). More importantly, CUs with zero earnings are not interesting themselves, in the way that their expenditures are mostly negligible. The effect, in terms of imputation, of including those CUs (suppose we have an ideal way to break them down) is more or less the same as having trivial records with zero or close to zero expenditures in donor dataset (Recall we only use number of exemptions, earnings and martial status to measure similarities).

3) Questions about categorizing CEX data by earnings and PUF data by income. At top of page 7, there is a confusing description where CEX units categorized by earnings group seem to be compared with PUF units categorized by income. I don't understand how that can be done in a sensible way. Perhaps more explanation will eliminate this issue.

The "income" I used in PUF is actually e00200, the wage variable. I will fix this confusion. Thanks for noticing this.

(4) Beginning in Section 4 there is no description of what the six e variables mean. Why not give the reader a break and explain in words what the deductible expense variable mean? Also, why is e18500 (real-estate taxes paid) not imputed? Seems like we still have a problem after all your imputation work because we still have a major deductible expense missing.

I'll add descriptions for those variables. Thanks for your suggestion. Regarding e18500, I'm aware of its importance, it is however not included because we couldn't find similar/compatible information in CEX.

(5) Question about what the phrase "non-ordinal categorial variable" means. I didn't see anyplace in the paper that describes what you mean by this term. Can you explain?

It means that this categorial variable has no clear ordering. I probably shouldn't have included the term "non-ordinal" since categorial variable readily implies that ordering is not clear.

I'll have a separate comment to address rest of the concerns.

martinholmer commented 7 years ago

@GoFroggyRun said about what @martinholmer said:

(2) Questions about Data Cleaning at bottom of page 3. I don't understand why you have dropped these three groups from the CEX sample. (a) You are splitting families into filing units, so why can't you split CEX consumer units into families?

For (a), maybe I am confused, but why would us interested in families rather than filing units?

I'm simply suggesting (because you have so few CEX observations) that you not discard multiple-family CEX units. Spit those CEX units into families, and then, use you procedures to split each of those families into tax filing units.

GoFroggyRun commented 7 years ago

@martinholmer said:

This is my biggest concern with what you have done (as far as I can tell from the paper). The six variables have imputed-value distributions on page 13 that are very different from what I would expect. For example, I would expect among non-itemizers that most would have zero non-cash charitable contributions and that a few would be positive non-cash charitable contributions. But the distribution for e20100 on page 13 shows most non-itemizers having a value of about $1400 and almost none having a value of zero. Why is that? Is my expectation about this variable's distribution in the CEX subsample of non-itemizers mistaken? Or, by taking the imputed value to be the average of the nearest 80 neighbors (if I'm understanding correctly what you're doing) are you distorting the CEX distribution of this variable? Put another way, I don't see how your imputation method handles correctly the mass point a zero for these six variables. Maybe more explanation would answer my question.

and followed-up by:

So, only 21.09 million of the 42.12 million itemizers (about 50 percent) have a positive amount. And only 4.84 million (about 23 percent of the positives and 11 percent of all itemizers) have a 2013 e20100 value larger than $1,000. But your graph on page 13 shows the vast majority of non-itemizers have imputed values of e20100 around $1,300 and very few have zero. Can the basic shape of the e20100 distribution among non-itemizers really be that different from the basic shape of the e20100 distribution among itemizers?

First of all, the distribution I presented in the paper is not weighted. Each record in the distribution has a uniformed weight. I probably should have specified that in the paper. Given this, each distribution I presented can actually be viewed as a re-scaled version of the corresponding CEX's distribution. I won't use the word "distort" since I'm simply averaging everything without extra treatment. Not sure about how weights would affect the aggregate distributions and results, nor sure about whether the weighted distribution would meet your expectation or not.

Last Friday, @feenberg also addressed concern regarding the distribution plots: he thinks these distributions are not smooth enough, in the way that they have multiple dips. In his opinion, taking the nearest 80 neighbors should have already alleviate such issue, comparing to taking only 1 neighbor. But these distributions still aren't good enough. One possible solution is to incorporate previous year CEX releases, but I am not sure how much time and effort it might take.

I'm still thinking what's the best strategy is to deal with concerns regarding these distributions. Any comments, concerns or remarks are mostly welcomed.

cc @MattHJensen

feenberg commented 7 years ago

I do think mixing 80 values is the problem. I left Sean with some ideas last Friday for re-evaluating the optimal k and we should give him a chance to implement that.

dan

On Tue, 16 May 2017, Sean.Wang wrote:

@martinholmer said:
  This is my biggest concern with what you have done (as far as I
  can tell from the paper). The six variables have imputed-value
  distributions on page 13 that are very different from what I
  would expect. For example, I would expect among non-itemizers
  that most would have zero non-cash charitable contributions and
  that a few would be positive non-cash charitable contributions.
  But the distribution for e20100 on page 13 shows most
  non-itemizers having a value of about $1400 and almost none
  having a value of zero. Why is that? Is my expectation about
  this variable's distribution in the CEX subsample of
  non-itemizers mistaken? Or, by taking the imputed value to be
  the average of the nearest 80 neighbors (if I'm understanding
  correctly what you're doing) are you distorting the CEX
  distribution of this variable? Put another way, I don't see how
  your imputation method handles correctly the mass point a zero
  for these six variables. Maybe more explanation would answer my
  question.
and followed-up by:
  So, only 21.09 million of the 42.12 million itemizers (about 50
  percent) have a positive amount.
  And only 4.84 million (about 23 percent of the positives and 11
  percent of all itemizers) have a 2013 e20100 value larger than
  $1,000.
  But your graph on page 13 shows the vast majority of
  non-itemizers have imputed values of e20100 around $1,300 and
  very few have zero.
  Can the basic shape of the e20100 distribution among
  non-itemizers really be that different from the basic shape of
  the e20100 distribution among itemizers?
First of all, the distribution I presented in the paper is not weighted. Each record in the distribution has a uniformed weight. I probably should have specified that in the paper. Given this, each distribution I presented can actually be viewed as a re-scaled version of the corresponding CEX's distribution. I won't use the word "distort" since I'm simply averaging everything without extra treatment. Not sure about how weights would affect the aggregate distributions and results, nor sure about whether the weighted distribution would meet your expectation or not.

Last Friday, @feenberg also addressed concern regarding the distribution plots: he thinks these distributions are not smooth enough, in the way that they have multiple dips. In his opinion, taking the nearest 80 neighbors should have already alleviate such issue, comparing to taking only 1 neighbor. But these distributions still aren't good enough. One possible solution is to incorporate previous year CEX releases, but I am not sure how much time and effort it might take.

I'm still thinking what's the best solution is to deal with concerns regarding these distributions. Any comments, concerns or remarks are mostly welcomed.

cc @MattHJensen

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVb-_3WN3Gf2gB3dv7Fkp4NfPMMULks5r6b0rgaJpZM4JcdFn.gif]

MattHJensen commented 7 years ago

@GoFroggyRun, in addition to posting the charts that we discussed today, could you post an overview of Dan's ideas relating to "mixing 80 values".

GoFroggyRun commented 7 years ago

@MattHJensen:

Here's the two plots we've discussed:

First one, weighted version of imputed variables:

weighted-density

And the original CEX distribution in uniform weights:

cex_distribution

Last Friday, @feenberg suggested a way of evaluating the effect of "mixing 80 neighbors" by plotting the correlation (variance) plot (i.e. variance against number of neighbors). After giving the suggestion a second thought, I don't think it sensible. Currently I am using mean squared error (MSE) to evaluate model goodnesses. The merit of bias-variance tradeoff showed that two components of MSE, namely bias and variance, will be monotonically decreasing and increasing respectively as number of neighbors increase. Thus some appropriate choice of number of neighbors would minimize the MSE. The idea behind such curve is that we are picking a point where bias won't overwhelm variance and vice versa. @feenberg's idea is that, in simple words, we want method that minimizes the correlation(variance). An immediate consequence of such choice (in this case choosing one neighbor to impute) is that our result will be seriously biased.

Maybe I am confused with our objective, since I'm using an algorithm that gives "global" optimization.

cc @martinholmer

feenberg commented 7 years ago

On Wed, 17 May 2017, Sean.Wang wrote:

@MattHJensen:

Here's the two plots we've discussed:

First one, weighted version of imputed variables:

weighted-density

And the original CEX distribution in uniform weights:

cex_distribution

Last Friday, @feenberg suggested a way of evaluating the effect of "mixing 80 neighbors" by plotting the correlation (variance) plot (i.e. variance against number of neighbors). After giving the suggestion a second thought, I don't think it sensible. Currently I am using mean squared error (MSE) to evaluate model goodnesses. The merit of bias-variance tradeoff showed that two components of MSE, namely bias and variance, will be monotonically decreasing and increasing respectively as number of neighbors increase. Thus some appropriate choice of number of neighbors would minimize the MSE. The idea behind such curve is that we are picking a point where bias won't overwhelm variance and vice versa. @feenberg's idea is that, in simple words, we want method that minimizes the correlation(variance). An immediate consequence of such choice (in this case choosing one neighbor to impute) is that our result will be seriously biased.

Maybe I am confused with our objective, since I'm using an algorithm that gives "global" optimization.

I think we want to minimize the error in the estimate of the correlation.

dan

cc @martinholmer

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVc3SxGtiUhyunbh9S0raD21_S0aQks5r6ylegaJpZM4JcdFn.gif]

martinholmer commented 6 years ago

Pull request #275 resolves taxdata issue #32.