Which variables should we seed on?

donboyd5 / synpuf

Synthetic PUF

MIT License

4 stars 3 forks source link

Which variables should we seed on? #37

Open MaxGhenis opened 5 years ago

MaxGhenis commented 5 years ago

We've seen that, in general, the more seeds in the synthesis production, the higher-fidelity the synthesis is, at the expense of privacy. More precisely, the relationship probably has to do with the unique identifiability of records when limited to the seeds.

For example, the only difference between the green and red bars here is that the green adds several more seeds:

Furthermore, even calculated seeds (which are dropped after the synthesis to be recalculated with Tax-Calculator) produce this relationship. The green bar above used calculated seeds.

Another data point supporting this is synthpop8, which used 9 calculated seeds ('E00100', 'E04600', 'P04470', 'E04800', 'E62100', 'E05800', 'E08800', 'E59560', 'E26190') that together uniquely identified over 80% of records. Each row in this synthesis exactly matched a training record, indicating we need to use far fewer seeds.

While we shouldn't use too many, we may also care a special amount about these calculated features, which could justify seeding on them rather than seeding on some other raw feature. Whether this approach improves the validity of calculated features like AGI is an empirical question we haven't tested, but it seems like a reasonable hypothesis.

Selecting the seeds is therefore one of the most important decisions in the synthesis process. I'd suggest a couple factors to consider in this decision:

Prioritizing categorical features. This simplifies the synthesis process to be only on continuous measures. So for example, we'd want to prioritize MARS.
Prioritizing logically "initial" features. For example, XTOT, nu18, MARS etc. are features of the household which logically precede income and deduction measures. This feeds into the question of visit sequence.
Prioritizing the most important features. This could be critical calculated features like AGI, or the most important features in determining those critical calculated features.

Regarding (3): I ran a random forests model to determine the importance of each "raw" feature in predicting the 9 calculated features in synthpop8. Here are the top 5, according to the average rank in predicting those 9:

E00200 (salaries and wages): most important for predicting E26190 (non-passive income) and E59560 (earned income for EIC).
E18400 (SALT): most important for E05800 (income tax before credit), E08800 (income tax after credits), and P04470 (total deductions).
S006 (weight): most important for E04800 (taxable income), E05800 (taxbc), and E08800 (taxac).
E02000 (Schedule E), most important for E26190 (non-passive income).
P23250 (Long-term gains less losses), most important for E00100 (AGI), E04800 (taxable income), and E62100 (alternative minimum taxable income).

Together these 5 features uniquely identify 61% of PUF records, so we'd probably still want a subset, especially if we add something like MARS and XTOT, but I suspect these will be valuable and avoid extra complexity of seeding on calculated features (also makes a simpler story to SOI that we're only using 65 features).

FEATURES = ['E00200', 'E18400', 'S006', 'E02000', 'P23250']
~pd.read_csv('~/puf2011.csv', usecols=FEATURES).duplicated(keep=False)).mean()
# 0.6131326698821662

feenberg commented 5 years ago

Oh, the seeds include the calculated variables - I wasn't aware of that. Still if the synthesis process amounts to "find a record with the same values as the seeds, and call that the synthetic record" then it isn't synthesizing at all. Is there an explanation for why this is happening?

dan

On Thu, 7 Mar 2019, Max Ghenis wrote:

We've seen that, in general, the more seeds in the synthesis production, the higher-fidelity the synthesis is, at the expense of privacy. More precisely, the relationship probably has to do with the unique identifiability of records when limited to the seeds.

For example, the only difference between the green and red bars here is that the green adds several more seeds: image

Furthermore, even calculated seeds (which are dropped after the synthesis to be recalculated with Tax-Calculator) produce this relationship. The green bar above used calculated seeds.

Another data point supporting this is synthpop8, which used 9 calculated seeds ('E00100', 'E04600', 'P04470', 'E04800', 'E62100', 'E05800', 'E08800', 'E59560', 'E26190') that together uniquely identified over 80% of records. Each row in this synthesis exactly matched a training record, indicating we need to use far fewer seeds.

While we shouldn't use too many, we may also care a special amount about these calculated features, which could justify seeding on them rather than seeding on some other raw feature. Whether this approach improves the validity of calculated features like AGI is an empirical question we haven't tested, but it seems like a reasonable hypothesis.

Selecting the seeds is therefore one of the most important decisions in the synthesis process. I'd suggest a couple factors to consider in this decision:

Prioritizing categorical features. This simplifies the synthesis process to be only on continuous measures. So for example, we'd want to prioritize MARS.

Prioritizing logically "initial" features. For example, XTOT, nu18, MARS etc. are features of the household which logically precede income and deduction measures. This feeds into the question of visit sequence.

Prioritizing the most important features. This could be critical calculated features like AGI, or the most important features in determining those critical calculated features.

Regarding (3): I ran a random forests model to determine the importance of each "raw" feature in predicting the 9 calculated features in synthpop8. Here are the top 5, according to the average rank in predicting those 9:

E00200 (salaries and wages): most important for predicting E26190 (non-passive income) and E59560 (earned income for EIC).

E18400 (SALT): most important for E05800 (income tax before credit), E08800 (income tax after credits), and P04470 (total deductions).

S006 (weight): most important for E04800 (taxable income), E05800 (taxbc), and E08800 (taxac).

E02000 (Schedule E), most important for E26190 (non-passive income).

P23250 (Long-term gains less losses), most important for E00100 (AGI), E04800 (taxable income), and E62100 (alternative minimum taxable income).

image

Together these 5 features uniquely identify 61% of PUF records, so we'd probably still want a subset, especially if we add something like MARS and XTOT, but I suspect these will be valuable and avoid extra complexity of seeding on calculated features (also makes a simpler story to SOI that we're only using 65 features).

FEATURES = ['E00200', 'E18400', 'S006', 'E02000', 'P23250'] ~pd.read_csv('~/puf2011.csv', usecols=FEATURES).duplicated(keep=False)).mean ()

0.6131326698821662

? You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVdlEflHBWi3-JwUWPjUESGb08bl1ks5vUFaYgaJpZM4biEaM.gif]

feenberg commented 5 years ago

What does it mean for "5 features uniquely identify 61% of PUF records". Does it mean "an exact match on 5 continuous variables" or something less?

MaxGhenis commented 5 years ago

if the synthesis process amounts to "find a record with the same values as the seeds, and call that the synthetic record" then it isn't synthesizing at all.

This isn't how the synthesis works in general, but it is how it works when there's no conditional variance of the synthesized features. If you have a tree-based model based on data where all records where x=2 and y=3 also have z=1, and you pass it data where x=2 and y=3, that tree-based model may assign 100% probability to the z=1 scenario. Depending on how strong this is, models that do more to fight overfitting like random forests could still assign that 100% probability. That seems to be what's happening here, and indicates we need to increase the conditional variance by reducing the conditions (seeds).

What does it mean for "5 features uniquely identify 61% of PUF records". Does it mean "an exact match on 5 continuous variables" or something less?

Right, restricting the PUF to ['E00200', 'E18400', 'S006', 'E02000', 'P23250'] produces a dataset where 61% of records are unique (this doesn't concern synthetic data).

feenberg commented 5 years ago

On Wed, 6 Mar 2019, Max Ghenis wrote:

  if the synthesis process amounts to "find a record with the same values as
  the seeds, and call that the synthetic record" then it isn't synthesizing
  at all.
This isn't how the synthesis works in general, but it is how it works when there's no conditional variance of the synthesized features. If you have a tree-based model based on data where all records where x=2 and y=3 also have z=1, and you pass it data where x=2 and y=3, that tree-based model may assign 100% probability to the z=1 scenario.

If there are 9 continuous variables, is it surprising there is only one exact match? I thought the "match" was only to "above median" or "below median", which should make a match fairly unlikely.

There is also the oddity that the revenue scores are so poor if all the matches are exact. Is it only the weights that are off??

Dan

Depending on how strong this is, models that do more to fight overfitting like random forests could still assign that 100% probability. That seems to be what's happening here, and indicates we need to increase the conditional variance by reducing the conditions (seeds).
  What does it mean for "5 features uniquely identify 61% of PUF records".
  Does it mean "an exact match on 5 continuous variables" or something less?
Right, restricting the PUF to ['E00200', 'E18400', 'S006', 'E02000', 'P23250'] produces a dataset where 61% of records are unique (this doesn't concern synthetic data).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVQCZp0RrlZAfPMgkEElUUYVZnc3rks5vUGhvgaJpZM4biEaM.gif]

MaxGhenis commented 5 years ago

I'm not really surprised but it depends on the variable; some are more fine than others, and some correlate more with others.

There is also the oddity that the revenue scores are so poor if all the matches are exact. Is it only the weights that are off??

It's probably mostly the weights. Are you using @donboyd5's revised linear-programmed weights? It could also be that it's not the same records in the same representation; all synthetic records are exactly present in the true PUF, but I haven't checked if the reverse is true.

feenberg commented 5 years ago

Isn't there a way to loosen the restriction for a match from exat match to "in the same bin"? In the examples I recall reading about, the bins were above or below median.

Isn't there a way to specify a minimum number of leaves before additional subdivision takes place? I recall reading examples where a minimum of 5 leaves were required.

Am I mistaken in my belief that the synthesis process maintains covariances only to the extent that they are mediated by seed variables? For example, what about the synthesis process encourages property tax and mortgage interest to be correlated?

Is there somewhere I can read up on this?

Dan

donboyd5 commented 5 years ago

Thanks, Max, this is great, with a lot of great detective work. It gives us lots to talk about tomorrow.

I created a Google doc named _selected_MARS3group_puf_synthpop8matches in our Google drive synpuf folder that explains some of my reasons for what I say below, and I also sent a link to each of you. To be on the safe side I am not putting the doc link here but if you have access to the folder you can get it.

I have four main comments:

1) As noted in the doc, one of the reasons we get so many exact matches is because the puf has been so blurred/modified - for example, by rounding. That puf-creation blurring increases the risk that we will produce values that have been changed from true values.

2) Not all exact matches present what seems like meaningful disclosure risk. Many involve mostly zero variables, variables that do not include much information, and records that are common, representing many people. The Google doc mentioned above gets into this in some detail.

That doesn't mean we shouldn't be on the lookout for them, but it does mean we have to interpret them carefully and think carefully about what to address and how.

3) Where we do have exact-match disclosure risk, it does not necessarily mean we need to make changes that significantly degrade the quality of the file to avoid exact matches, such as eliminating powerful seeds. We have multiple options that go beyond changing seed variables, including:

adding small amounts of random noise to variables on the front end, before synthesizing, so that synthesized values are not exactly the same as puf values
using methods during synthesis that ensure we are less likely to put exact puf values on the synthesized files, including (a) larger buckets for CART methods, (b) or possibly econometric methods (although I have my doubts about this), and (c) density approaches for choosing leaves on terminal nodes.

Where we do have to reduce seeds, and we may need to, Max's detective work will prove really valuable.

4) Assuming we get rid of all important exact matches (and not all are important), that doesn't mean we don't have disclosure risk. Distance measures will remain important. If a file is close but not perfect, it may still have too much disclosure risk.

MaxGhenis commented 5 years ago

@donboyd5 your doc says:

The total number of puf records involved in exact matches (npufrecs=419) out of the 3,144 puf records with MARS=3, and the number of syn records involved in exact matches (nsynrecs=1,057) of of the 15,720 syn records in the group.

Could you share some records in synthpop8 that you found don't exactly match a training record. I just triple-checked that all records in synthpop8 exactly match training records on all features in this notebook. Note I'm dropping S006 because that'll be reconstructed, and isn't relevant to privacy concerns.

We should decide whether we're treating PUF data as real data, as we've discussed in the past. We know that SOI blurs and rounds data, that lots of fields are zero, and that some records are duplicated when limiting to the 65 features we're synthesizing, but in lieu of the real data or details on how exactly they blur, how many real records each PUF record represents, etc., I think we need to just treat it as real data. That should mean avoiding synthesizing exact matches on records that appear only once in the PUF.

@feenberg asked:

Isn't there a way to loosen the restriction for a match from exat match to "in the same bin"? In the examples I recall reading about, the bins were above or below median.

Right now we're seeing true exact matches, and we're also looking at distance measures. I think below/above median would be too blunt an instrument to evaluate privacy concerns.

Isn't there a way to specify a minimum number of leaves before additional subdivision takes place? I recall reading examples where a minimum of 5 leaves were required.

Yes I think synthpop CART does this, but I'm not sure this guarantees variance.

Am I mistaken in my belief that the synthesis process maintains covariances only to the extent that they are mediated by seed variables? For example, what about the synthesis process encourages property tax and mortgage interest to be correlated?

No, the synthesis maintains covariances by including them in each prediction model. Suppose we only seed on MARS, and then the first two non-seed synthesized features are property tax and mortgage interest. Property tax will essentially be synthesized as the distribution of property tax, conditional on each MARS value. Mortgage interest will then be synthesized as the distribution of mortgage interest conditional on each record's MARS value, and its conditional property tax. Each covariance is maintained this way: one of each pair of features is synthesized as the distribution conditioned (at least in part) on the other.

donboyd5 commented 5 years ago

Let me dig it out and send email back.

These are a lot of good things to talk about tomorrow. Does 2pm ET work for you the two of you?

Don

On Thu, Mar 7, 2019 at 1:58 PM Max Ghenis notifications@github.com wrote:

@donboyd5 https://github.com/donboyd5 your doc https://docs.google.com/document/d/1c3Sz3MY1oXOugYX8h4EcGKFzm9AROmGsYWGOxKMcKQc says:

The total number of puf records involved in exact matches (npufrecs=419) out of the 3,144 puf records with MARS=3, and the number of syn records involved in exact matches (nsynrecs=1,057) of of the 15,720 syn records in the group.

Could you share some records in synthpop8 that you found don't exactly match a training record. I just triple-checked that all records in synthpop8 exactly match training records on all features in this notebook https://colab.research.google.com/drive/13qxcg_GEzUONqMyw_UaSMB2PN4k8kcDH. Note I'm dropping S006 because that'll be reconstructed, and isn't relevant to privacy concerns.

We should decide whether we're treating PUF data as real data, as we've discussed in the past. We know that SOI blurs and rounds data, that lots of fields are zero, and that some records are duplicated when limiting to the 65 features we're synthesizing, but in lieu of the real data or details on how exactly they blur, how many real records each PUF record represents, etc., I think we need to just treat it as real data. That should mean avoiding synthesizing exact matches on records that appear only once in the PUF.

@feenberg https://github.com/feenberg asked:

Isn't there a way to loosen the restriction for a match from exat match to "in the same bin"? In the examples I recall reading about, the bins were above or below median.

Right now we're seeing true exact matches, and we're also looking at distance measures. I think below/above median would be too blunt an instrument to evaluate privacy concerns.

Isn't there a way to specify a minimum number of leaves before additional subdivision takes place? I recall reading examples where a minimum of 5 leaves were required.

Yes I think synthpop CART does this, but I'm not sure this guarantees variance.

Am I mistaken in my belief that the synthesis process maintains covariances only to the extent that they are mediated by seed variables? For example, what about the synthesis process encourages property tax and mortgage interest to be correlated?

No, the synthesis maintains covariances by including them in each prediction model. Suppose we only seed on MARS, and then the first two non-seed synthesized features are property tax and mortgage interest. Property tax will essentially be synthesized as the distribution of property tax, conditional on each MARS value. Mortgage interest will then be synthesized as the distribution of mortgage interest conditional on each record's MARS value, and its conditional property tax. Each covariance is maintained this way: one of each pair of features is synthesized as the distribution conditioned (at least in part) on the other.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/37#issuecomment-470650667, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmO6j0-Rl14Bsv4jh8YVOcpOKLwCgks5vUWE4gaJpZM4biEaM .

donboyd5 commented 5 years ago

I matched from puf to syn, rather than from syn to puf. Maybe that is the difference. But let me check.

That's worth discussion, too.

Don

On Thu, Mar 7, 2019 at 2:02 PM Don Boyd donboyd5@gmail.com wrote:

Let me dig it out and send email back.

These are a lot of good things to talk about tomorrow. Does 2pm ET work for you the two of you?

Don

On Thu, Mar 7, 2019 at 1:58 PM Max Ghenis notifications@github.com wrote:

@donboyd5 https://github.com/donboyd5 your doc https://docs.google.com/document/d/1c3Sz3MY1oXOugYX8h4EcGKFzm9AROmGsYWGOxKMcKQc says:

The total number of puf records involved in exact matches (npufrecs=419) out of the 3,144 puf records with MARS=3, and the number of syn records involved in exact matches (nsynrecs=1,057) of of the 15,720 syn records in the group.

Could you share some records in synthpop8 that you found don't exactly match a training record. I just triple-checked that all records in synthpop8 exactly match training records on all features in this notebook https://colab.research.google.com/drive/13qxcg_GEzUONqMyw_UaSMB2PN4k8kcDH. Note I'm dropping S006 because that'll be reconstructed, and isn't relevant to privacy concerns.

We should decide whether we're treating PUF data as real data, as we've discussed in the past. We know that SOI blurs and rounds data, that lots of fields are zero, and that some records are duplicated when limiting to the 65 features we're synthesizing, but in lieu of the real data or details on how exactly they blur, how many real records each PUF record represents, etc., I think we need to just treat it as real data. That should mean avoiding synthesizing exact matches on records that appear only once in the PUF.

@feenberg https://github.com/feenberg asked:

Isn't there a way to loosen the restriction for a match from exat match to "in the same bin"? In the examples I recall reading about, the bins were above or below median.

Right now we're seeing true exact matches, and we're also looking at distance measures. I think below/above median would be too blunt an instrument to evaluate privacy concerns.

Isn't there a way to specify a minimum number of leaves before additional subdivision takes place? I recall reading examples where a minimum of 5 leaves were required.

Yes I think synthpop CART does this, but I'm not sure this guarantees variance.

Am I mistaken in my belief that the synthesis process maintains covariances only to the extent that they are mediated by seed variables? For example, what about the synthesis process encourages property tax and mortgage interest to be correlated?

No, the synthesis maintains covariances by including them in each prediction model. Suppose we only seed on MARS, and then the first two non-seed synthesized features are property tax and mortgage interest. Property tax will essentially be synthesized as the distribution of property tax, conditional on each MARS value. Mortgage interest will then be synthesized as the distribution of mortgage interest conditional on each record's MARS value, and its conditional property tax. Each covariance is maintained this way: one of each pair of features is synthesized as the distribution conditioned (at least in part) on the other.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/37#issuecomment-470650667, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmO6j0-Rl14Bsv4jh8YVOcpOKLwCgks5vUWE4gaJpZM4biEaM .

MaxGhenis commented 5 years ago

Would 1PM or 1:30PM be OK?

I think syn->puf is more relevant to privacy concerns, since we want to avoid releasing synthetic records that look too much like real records. The reverse is useful for comprehensiveness--to the extent that real records add value, ensuring they're not totally ignored by the model will probably produce a better synthesis--but outside this particular scope IMO.

feenberg commented 5 years ago

On Thu, 7 Mar 2019, Max Ghenis wrote:

@donboyd5 your doc says:
  The total number of puf records involved in exact matches
  (npufrecs=419) out of the 3,144 puf records with MARS=3, and the
  number of syn records involved in exact matches (nsynrecs=1,057)
  of of the 15,720 syn records in the group.
Could you share some records in synthpop8 that you found don't exactly match a training record. I just triple-checked that all records in synthpop8 exactly match training records on all features in this notebook. Note I'm dropping S006 because that'll be reconstructed, and isn't relevant to privacy concerns.

We should decide whether we're treating PUF data as real data, as we've discussed in the past. We know that SOI blurs and rounds data, that lots of fields are zero, and that some records are duplicated when limiting to the 65 features we're synthesizing, but in lieu of the real data or details on how exactly they blur, how many real records each PUF record represents, etc., I think we need to just treat it as real data. That should mean avoiding synthesizing exact matches on records that appear only once in the PUF.

@feenberg asked:
  Isn't there a way to loosen the restriction for a match from
  exat match to "in the same bin"? In the examples I recall
  reading about, the bins were above or below median.
Right now we're seeing true exact matches, and we're also looking at distance measures. I think below/above median would be too blunt an instrument to evaluate privacy concerns.
  Isn't there a way to specify a minimum number of leaves before
  additional subdivision takes place? I recall reading examples
  where a minimum of 5 leaves were required.
Yes I think synthpop CART does this, but I'm not sure this guarantees variance.
  Am I mistaken in my belief that the synthesis process maintains
  covariances only to the extent that they are mediated by seed
  variables? For example, what about the synthesis process
  encourages property tax and mortgage interest to be correlated?
No, the synthesis maintains covariances by including them in each prediction model. Suppose we only seed on MARS, and then the first two non-seed synthesized features are property tax and mortgage interest. Property tax will essentially be synthesized as the distribution of property tax, conditional on each MARS value. Mortgage interest will then be synthesized

I understand this. We sample from the property tax values divided into MARS categories.

as the distribution of mortgage interest conditional on each record's MARS value, and its conditional property tax. Each covariance is maintained this

How is this done? "conditional on" can cover a multitude of possible procedures when the variables are continuous. I am thinking that bins for the cross of MARS and property tax ranges are created, and for each puf record a value of mortgage interest is selected from a record whose MARS and property tax fall into the same bin. But even if the bins start out with large numbers of possible values, don't the bins get very small as the number of variables synthesized increases? I am think that 2**65 is a very large number.

I guess I am still confused.

dan

way: one of each pair of features is synthesized as the distribution conditioned (at least in part) on the other.

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVR_XtfIsRqJFjObgc8fxkXzBdr1yks5vUWE4gaJpZM4biEaM.gif]

donboyd5 commented 5 years ago

Either of those times is good for me.

Don

On Thu, Mar 7, 2019 at 2:08 PM Max Ghenis notifications@github.com wrote:

Would 1PM or 1:30PM be OK?

I think syn->puf is more relevant to privacy concerns, since we want to avoid releasing synthetic records that look too much like real records. The reverse is useful for comprehensiveness--to the extent that real records add value, ensuring they're not totally ignored by the model will probably produce a better synthesis--but outside this particular scope IMO.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/37#issuecomment-470654581, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmPmJ6V7SsqhwwuPLbWoRRog8vuz0ks5vUWPDgaJpZM4biEaM .

MaxGhenis commented 5 years ago

@feenberg We're basically using quantile regression, where the regression incorporates all the seeds and previously synthesized features. So we're predicting the 10th percentile, 20th, 30th, etc., and sampling a random quantile from there to capture the full conditional distribution.

In reality both CART and RF do this nonparametrically, so it's something in between the binning approach you describe and parametric regression models.

donboyd5 commented 5 years ago

Hi Max,

I'm really sorry to say this, but somehow the Google Drive file ...synpuf\syntheses\synthpop8.csv is, despite its name, the puf, which you can see from the variable ftype and from the # of records (=# in puf, not 5x that #). I don't know how I did it, but apparently I did and I'm sorry. I know you invested a lot of work in it. As one possible small silver lining, maybe we all benefited nonetheless if it drove you to think about ways to select seed variables that are more methodical and smart than what I've been doing. But I never want to waste anyone's time, so I'm sorry about that.

Anyway, I went back to the file I've been using, which is ...synpuf\synthpop8_stack.csv, which has the synthetic file stacked with a conforming puf (i.e., without aggregate records, and with only the same variables). I have pulled 72 synthetic records that are in the synthetic part of that file but not in the puf part of the file and written them to ...synpuf\synthpop8_selected_nonmatches.csv. These all have MARS=3.

In addition to the synthesized variables it has the following variables of note:

ftype -- identifies the portion of synthpop8_stack.csv that this record comes from puf or syn -- it will be syn for every record because I found no such records in the puf part
rownum -- this is the number of the row in synthpop8_stack.csv in which you can find this record in case you want to do that. IT IS NOT A COLUMN IN synthpop8_stack.csv -- I CREATED IT AFTER THE FACT -- BUT IT WILL MATCH THE SEQUENCE POSITION OF THE RECORD IN synthpop8_stack.csv.
n -- the number of records in synthpop8_stack.csv that are identical to this record, based on the variables to the right of wt (I did not include wt in the exact match); you will note that every record either has n=10 or n=62 -- there are two different sets of identical records
npuf - the number of those records (the n records) that came from the puf part of the file; this will be 0 for all
nsyn - the number of those records that came from the syn part of the file (this will be either 10 or 62 for all)

Again, I'm sorry about this. This highlights the importance of Dan's comment about having file structure and names all in one place. Maybe I can do that by saying more in the Google sheet, or else by writing a Google doc. Anyway, let's include this in our discussion tomorrow.

Don

On Thu, Mar 7, 2019 at 2:08 PM Max Ghenis notifications@github.com wrote:

Would 1PM or 1:30PM be OK?

I think syn->puf is more relevant to privacy concerns, since we want to avoid releasing synthetic records that look too much like real records. The reverse is useful for comprehensiveness--to the extent that real records add value, ensuring they're not totally ignored by the model will probably produce a better synthesis--but outside this particular scope IMO.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/37#issuecomment-470654581, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmPmJ6V7SsqhwwuPLbWoRRog8vuz0ks5vUWPDgaJpZM4biEaM .

donboyd5 commented 5 years ago

Dan do you have a preference for 1:00 pm or 1:30 pm (Eastern time) tomorrow (assuming you can make the call)?

donboyd5 commented 5 years ago

I should add that I checked for exact matches in both directions -- within MARS=3, all puf against all syn records and all syn records against all puf records. (This is easy for exact-match checks. Much more computing work for distance measures.)

feenberg commented 5 years ago

On Thu, 7 Mar 2019, Max Ghenis wrote:

@feenberg We're basically using quantile regression, where the regression incorporates all the seeds and previously synthesized features. So we're predicting the 10th percentile, 20th, 30th, etc., and sampling a random quantile from there to capture the full conditional distribution.

But unless the quantile regression imposes some structure on the shape of the distribution, you end up in the end with 10**65 bins, so most bins will have zero entries, but a few will have a single entry. I imagine the quantile regression does impose structure - linear or log-linear of some sort.

It seems to me that the mere fact that the synthesized records are no different from the training records is positive evidence that something is wrong with the methodology, and should not be ascribed to having too many seed variables. You describe the result as "sampling a random quantile". How can the random sample be the same as the training set unless choosen from a universe of one? Isn't that the problem, not the large number of seeds. But apparently increasing the number of seeds decreases the size of leaves from which to sample. Can that be right?

dan

In reality both CART and RF do this nonparametrically, so it's something in between the binning approach you describe and parametric regression models.

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVYipwPvfaeD9nMYD-jLDxiF_zXojks5vUWcXgaJpZM4biEaM.gif]

donboyd5 commented 5 years ago

Please see my prior note. It's my fault. The synthpop8.csv file Max was looking at was indeed the puf. The note above gives the proper file to use.

Don

On Thu, Mar 7, 2019 at 4:36 PM Daniel Feenberg notifications@github.com wrote:

On Thu, 7 Mar 2019, Max Ghenis wrote:

@feenberg We're basically using quantile regression, where the regression incorporates all the seeds and previously synthesized features. So we're predicting the 10th percentile, 20th, 30th, etc., and sampling a random quantile from there to capture the full conditional distribution.

But unless the quantile regression imposes some structure on the shape of the distribution, you end up in the end with 10**65 bins, so most bins will have zero entries, but a few will have a single entry. I imagine the quantile regression does impose structure - linear or log-linear of some sort.

It seems to me that the mere fact that the synthesized records are no different from the training records is positive evidence that something is wrong with the methodology, and should not be ascribed to having too many seed variables. You describe the result as "sampling a random quantile". How can the random sample be the same as the training set unless choosen from a universe of one? Isn't that the problem, not the large number of seeds. But apparently increasing the number of seeds decreases the size of leaves from which to sample. Can that be right?

dan

In reality both CART and RF do this nonparametrically, so it's something in between the binning approach you describe and parametric regression models.

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVYipwPvfaeD9nMYD-jLDxiF_zXojks5vUWcXgaJpZM4biEaM.gif]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/37#issuecomment-470704205, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmH_QcY0QZwBPp-IOBtog0UXUpQ3dks5vUYZIgaJpZM4biEaM .

feenberg commented 5 years ago

So only 72 synthetic records are not identical to a PUF record? This is a randomization process that only epsilon away from a simple copy command. This can only happen if the "sampling from a conditional distribution" is sampling from a universe of one for each value in each output record.

Dan

On Thu, 7 Mar 2019, Don Boyd wrote:

Hi Max,

I'm really sorry to say this, but somehow the Google Drive file ...synpuf\syntheses\synthpop8.csv is, despite its name, the puf, which you can see from the variable ftype and from the # of records (=# in puf, not 5x that #). I don't know how I did it, but apparently I did and I'm sorry. I know you invested a lot of work in it. As one possible small silver lining, maybe we all benefited nonetheless if it drove you to think about ways to select seed variables that are more methodical and smart than what I've been doing. But I never want to waste anyone's time, so I'm sorry about that.

Anyway, I went back to the file I've been using, which is ...synpuf\synthpop8_stack.csv, which has the synthetic files stacked with a conforming puf (i.e., without aggregate records, and with only the same variables). I have pulled 72 synthetic records that are in the synthetic part of that file but not in the puf part of the file and written them to ...synpuf\synthpop8_selected_nonmatches.csv. These all have MARS=3.

In addition to the synthesized variables it has the following variables of note:

ftype -- identifies the portion of synthpop8_stack.csv that this record comes from puf or syn -- it will be syn for every record because I found no such records in the puf part

rownum -- this is the number of the row in synthpop8_stack.csv in which you can find this record in case you want to do that. IT IS NOT A COLUMN IN synthpop8_stack.csv -- I CREATED IT AFTER THE FACT -- BUT IT WILL MATCH THE SEQUENCE POSITION OF THE RECORD IN synthpop8_stack.csv.

n -- the number of records in synthpop8_stack.csv that are identical to this record, based on the variables to the right of wt (I did not include wt in the exact match); you will note that every record either has n=10 or n=62 -- there are two different sets of identical records

npuf - the number of those records (the n records) that came from the puf part of the file; this will be 0 for all

nsyn - the number of those records that came from the syn part of the file (this will be either 10 or 62 for all)

Again, I'm sorry about this. This highlights the importance of Dan's comment about having file structure and names all in one place. Maybe I can do that by saying more in the Google sheet, or else by writing a Google doc. Anyway, let's include this in our discussion tomorrow.

Don

On Thu, Mar 7, 2019 at 2:08 PM Max Ghenis notifications@github.com wrote:

Would 1PM or 1:30PM be OK?

I think syn->puf is more relevant to privacy concerns, since we want to avoid releasing synthetic records that look too much like real records. The reverse is useful for comprehensiveness--to the extent that real records add value, ensuring they're not totally ignored by the model will probably produce a better synthesis--but outside this particular scope IMO.

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/37#issuecomment-470654581, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmPmJ6V7SsqhwwuPLbWoR Rog8vuz0ks5vUWPDgaJpZM4biEaM .

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVUiqr95Bnhz0UA8k1yrGsXCnaw5xks5vUX7rgaJpZM4biEaM.gif]

feenberg commented 5 years ago

On Thu, 7 Mar 2019, Don Boyd wrote:

Dan do you have a preference for 1:00 pm or 1:30 pm tomorrow (assuming you can make the call)?

1:30 preferred.

dan

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVaKSK-ZOo_mGdbMFuSFT_NDvvMjvks5vUYAMgaJpZM4biEaM.gif]

donboyd5 commented 5 years ago

No, it was just a selection of nonmatching records. See Google doc mentioned earlier.

On Thu, Mar 7, 2019 at 4:55 PM Daniel Feenberg notifications@github.com wrote:

So only 72 synthetic records are not identical to a PUF record? This is a randomization process that only epsilon away from a simple copy command. This can only happen if the "sampling from a conditional distribution" is sampling from a universe of one for each value in each output record.

Dan

On Thu, 7 Mar 2019, Don Boyd wrote:

Hi Max,

I'm really sorry to say this, but somehow the Google Drive file ...synpuf\syntheses\synthpop8.csv is, despite its name, the puf, which you can see from the variable ftype and from the # of records (=# in puf, not 5x that #). I don't know how I did it, but apparently I did and I'm sorry. I know you invested a lot of work in it. As one possible small silver lining, maybe we all benefited nonetheless if it drove you to think about ways to select seed variables that are more methodical and smart than what I've been doing. But I never want to waste anyone's time, so I'm sorry about that.

Anyway, I went back to the file I've been using, which is ...synpuf\synthpop8_stack.csv, which has the synthetic files stacked with a conforming puf (i.e., without aggregate records, and with only the same variables). I have pulled 72 synthetic records that are in the synthetic part of that file but not in the puf part of the file and written them to ...synpuf\synthpop8_selected_nonmatches.csv. These all have MARS=3.

In addition to the synthesized variables it has the following variables of note:

ftype -- identifies the portion of synthpop8_stack.csv that this record comes from puf or syn -- it will be syn for every record because I found no such records in the puf part

rownum -- this is the number of the row in synthpop8_stack.csv in which you can find this record in case you want to do that. IT IS NOT A COLUMN IN synthpop8_stack.csv -- I CREATED IT AFTER THE FACT -- BUT IT WILL MATCH THE SEQUENCE POSITION OF THE RECORD IN synthpop8_stack.csv.

n -- the number of records in synthpop8_stack.csv that are identical to this record, based on the variables to the right of wt (I did not include wt in the exact match); you will note that every record either has n=10 or n=62 -- there are two different sets of identical records

npuf - the number of those records (the n records) that came from the puf part of the file; this will be 0 for all

nsyn - the number of those records that came from the syn part of the file (this will be either 10 or 62 for all)

Again, I'm sorry about this. This highlights the importance of Dan's comment about having file structure and names all in one place. Maybe I can do that by saying more in the Google sheet, or else by writing a Google doc. Anyway, let's include this in our discussion tomorrow.

Don

On Thu, Mar 7, 2019 at 2:08 PM Max Ghenis notifications@github.com wrote:

Would 1PM or 1:30PM be OK?

I think syn->puf is more relevant to privacy concerns, since we want to avoid releasing synthetic records that look too much like real records. The reverse is useful for comprehensiveness--to the extent that real records add value, ensuring they're not totally ignored by the model will probably produce a better synthesis--but outside this particular scope IMO.

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/37#issuecomment-470654581, or mute the thread < https://github.com/notifications/unsubscribe-auth/AGPEmPmJ6V7SsqhwwuPLbWoR Rog8vuz0ks5vUWPDgaJpZM4biEaM> .

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVUiqr95Bnhz0UA8k1yrGsXCnaw5xks5vUX7rgaJpZM4biEaM.gif]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/37#issuecomment-470710276, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmKi_pceUxc5VHH3sd9fqjZ3t6kE2ks5vUYrIgaJpZM4biEaM .

donboyd5 commented 5 years ago

You should begin much earlier in the thread at https://github.com/donboyd5/synpuf/issues/37#issuecomment-470592922 (or earlier) - and also read the google doc mentioned there.

On Thu, Mar 7, 2019 at 5:17 PM Don Boyd donboyd5@gmail.com wrote:

No, it was just a selection of nonmatching records. See Google doc mentioned earlier.

On Thu, Mar 7, 2019 at 4:55 PM Daniel Feenberg notifications@github.com wrote:

So only 72 synthetic records are not identical to a PUF record? This is a randomization process that only epsilon away from a simple copy command. This can only happen if the "sampling from a conditional distribution" is sampling from a universe of one for each value in each output record.

Dan

On Thu, 7 Mar 2019, Don Boyd wrote:

Hi Max,

I'm really sorry to say this, but somehow the Google Drive file ...synpuf\syntheses\synthpop8.csv is, despite its name, the puf, which you can see from the variable ftype and from the # of records (=# in puf, not 5x that #). I don't know how I did it, but apparently I did and I'm sorry. I know you invested a lot of work in it. As one possible small silver lining, maybe we all benefited nonetheless if it drove you to think about ways to select seed variables that are more methodical and smart than what I've been doing. But I never want to waste anyone's time, so I'm sorry about that.

Anyway, I went back to the file I've been using, which is ...synpuf\synthpop8_stack.csv, which has the synthetic files stacked with a conforming puf (i.e., without aggregate records, and with only the same variables). I have pulled 72 synthetic records that are in the synthetic part of that file but not in the puf part of the file and written them to ...synpuf\synthpop8_selected_nonmatches.csv. These all have MARS=3.

In addition to the synthesized variables it has the following variables of note:

ftype -- identifies the portion of synthpop8_stack.csv that this record comes from puf or syn -- it will be syn for every record because I found no such records in the puf part

rownum -- this is the number of the row in synthpop8_stack.csv in which you can find this record in case you want to do that. IT IS NOT A COLUMN IN synthpop8_stack.csv -- I CREATED IT AFTER THE FACT -- BUT IT WILL MATCH THE SEQUENCE POSITION OF THE RECORD IN synthpop8_stack.csv.

n -- the number of records in synthpop8_stack.csv that are identical to this record, based on the variables to the right of wt (I did not include wt in the exact match); you will note that every record either has n=10 or n=62 -- there are two different sets of identical records

npuf - the number of those records (the n records) that came from the puf part of the file; this will be 0 for all

nsyn - the number of those records that came from the syn part of the file (this will be either 10 or 62 for all)

Again, I'm sorry about this. This highlights the importance of Dan's comment about having file structure and names all in one place. Maybe I can do that by saying more in the Google sheet, or else by writing a Google doc. Anyway, let's include this in our discussion tomorrow.

Don

On Thu, Mar 7, 2019 at 2:08 PM Max Ghenis notifications@github.com wrote:

Would 1PM or 1:30PM be OK?

I think syn->puf is more relevant to privacy concerns, since we want to avoid releasing synthetic records that look too much like real records. The reverse is useful for comprehensiveness--to the extent that real records add value, ensuring they're not totally ignored by the model will probably produce a better synthesis--but outside this particular scope IMO.

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/37#issuecomment-470654581, or mute the thread < https://github.com/notifications/unsubscribe-auth/AGPEmPmJ6V7SsqhwwuPLbWoR Rog8vuz0ks5vUWPDgaJpZM4biEaM> .

? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVUiqr95Bnhz0UA8k1yrGsXCnaw5xks5vUX7rgaJpZM4biEaM.gif]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/37#issuecomment-470710276, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmKi_pceUxc5VHH3sd9fqjZ3t6kE2ks5vUYrIgaJpZM4biEaM .

MaxGhenis commented 5 years ago

@donboyd5 no problem thanks for checking. I re-ran the distance metrics on 1% of synthpop8_stack.csv and found that 25% of records exactly match a training record, or about 3x the share from your earlier synthpop. The median distance is about 0.08, also about a third of a previous model. I'll also email the group a case of a pretty complicated record that matches exactly: synthpop8_stack row 866210 matching training record 162458 (or subtract 1 if not zero-indexing). This still suggests to me we need to cut some seeds.

@feenberg Linear quantile regression would impose a linear structure on the relationships, but by using RF/CART, we don't impose a structure, nor do we have to define huge number of bins. These tree methods split on each node (feature) based on semi-random thresholds, and then either recursively improves these splits (CART) or builds many trees (RF) to produce the predictions, which are then sampled to generate the conditional quantiles. Here's an explanation of RF: everything is the same in RF regression as it is in RF quantile regression, except for the final stage where we use the distribution of predictions instead of the mean.

donboyd5 commented 5 years ago

Re @MaxGhenis's earlier comment, I agree, we need to treat puf data as if they are true tax returns, and hold ourselves to that standard. Whether we think that is best or not doesn't matter as it is how SOI wants to view it, so we need to view it that way, also. While I have made some comments about certain kinds of exact matches that shouldn't be worrisome, I think we need to worry about them nonetheless and find best possible ways of eradicating any exact matches that involve non-zero-valued continuous variables (in addition to categorical variables) and perhaps even exact matches that include categoricals and only zero-valued continuous variables - these are good topics for discussion.

That said, in some senses it may be a harder test than comparison to true returns, and in others it might be easier. I think exact matches are likely to be less of a concern vs. true returns (as they will not have been blurred), but I am not sure whether distances will be a harder or easier test. I do believe that after we get fully comfortable with comparisons to puf, we should seek a way to get low-stakes comparisons done against true returns before we face a high-stakes do or die test (via SOI) by that approach.

donboyd5 commented 5 years ago

I kept promising to pull together some notes on distance measures. I have been failing, but I have made some progress. You can find what I've done here. I'll try to update it.