Making distance calculations manageable

donboyd5 commented 5 years ago

Issue:

We plan to calculate distances between records in the synthetic PUF and records in the actual PUF, to help evaluate disclosure risk.
With ~160k records in each file, if we computed every distance it would be 25.6 billion distances (160k^2).
That's a lot, even for modern equipment, so it makes sense to look for ways to reduce the problem size.
One way people do this is by breaking a file up into mutually exclusive blocks or segments, and only comparing records in one file with records in the same segment in the other file, as they do in the paper in #20.
That works well with obviously mutually exclusive groups. For example, we can choose to do this with marital status (MARS), only comparing marrieds to marrieds, and non-marrieds to non-marrieds. That breaks the file (taking liberty with rounding) into two segments of approximately 80k records each, and we now have to do only 12.8 billion comparisons (80k x 80k x 2).
The problem is, we really don't have a lot of variables that form obviously mutually exclusive groups, other than MARS. Obviously we don't need to compute the distance between a record with $10k AGI and a record with $5m AGI, so we could preclude that kind of calculation, but if we divide the file up into income strata with a break at $25k, would we want to preclude a comparison between a record in A, with AGI of $24,999 in the <=$25k segment, with a record in B, with AGI of $25,001, in the > $25k to <=$50k segment? No.

We need a better way to reduce the problem size.

donboyd5 commented 5 years ago

Here is an approach I have used with large matching problems in the past that has worked well:

Call the two files to be compared A and B.
Let's assume we can construct a variable on each file that will help us decide whether to compare a record in A with a record in B. For example, we know (I think) that we don't even care what the distance is between the $10k AGI record in A and the $5m AGI record in B from the example above. We know, prima facie, that they are not worth comparing. So income might be a good variable. For computational and programming convenience, we might just add major positive income items together (wages, dividends, interest, maybe a few others). If we were nitpicky, we might run the files through Tax-Calculator and get AGI but that seems like overkill to me.
We don't want to break the files into hard groups because of the example above - we ought to want to compute distances between a $24,999 income record in A and a $25,001 income record in B.
We sort the records in each file/marital-status group by income and assign a percentile rank to each record in A and B within its group. For example, the $24,999 A-record might have percentile 7 in its A-group, and the $25,001 B-record might have percentile 8 in its B-group.
We set a rule for determining what distances we will calculate. For example, we might compute distances for all records that are within +/- 5 percentile points of each other in their respective marital status block. A 7th percentile record in A might be compared with all B records that fall in the 2nd to 12th percentile in B. A 43rd percentile record in A might be compared with all B records that fall in the 38th to 48th percentile in B. (We might or might not adjust this rule for the extreme ends of the file.)
We use a fast indexing scheme. We loop through every record in A and index into B to define the records in B that we will compare against, and compute distances only versus those records.
Under this approach we will compare every record in each A marital block of 80k records with approximately 8k records in the corresponding marital block of B. The problem size is now 80k x 8k x 2 = 1.28 billion distance calculations (ignoring the cost of determining which calculations to do). That might be tractable.
If not, we might want to narrow the percentile range, or look for another blocking variable that might allow further breaking of the file into mutually exclusive groups. (We could consider XTOT, but I'm not sure we'd want to disallow comparisons between records where, for example, XTOT is 4 and XTOT is 5. However, some grouping might be possible.)

Anyway, this is one approach that should be practical. Variants are possible, including a variant that incorporates ideas from the predictive mean matching that OSPC does between the PUF and CPS.

MaxGhenis commented 5 years ago

This makes sense but probably precludes usage of scipy.cdist, which runs all pairwise comparisons for two tables. Since this is more optimized (written in C) how about adapting it to still use buckets, and include +/-1 bucket, like this (could be AGI or something simpler):

donboyd5 commented 5 years ago

I think that's a great alternative. If it proves too slow, maybe then we should reconsider the indexing.

On Fri, Dec 14, 2018 at 7:49 PM Max Ghenis notifications@github.com wrote:

This makes sense but probably precludes usage of scipy.cdist, which runs all pairwise comparisons for two tables. Since this is more optimized (written in C) how about adapting it to still use buckets, and include +/-1 bucket, like this (could be AGI or something simpler): [image: image] https://user-images.githubusercontent.com/6076111/50036753-41f24400-ffc0-11e8-835b-bb557054a2f0.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/22#issuecomment-447520736, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmLDzRXNoDbptGVyOEMpnwf06UQH5ks5u5Ec3gaJpZM4ZRMv0 .

MaxGhenis commented 5 years ago

Could you share how you're running Tax-Calculator? Is it from the CLI?

@andersonfrailey what would be the easiest way to call Tax-Calculator on synpuf to (a) ensure validity and (b) calculate AGI, from Python?

donboyd5 commented 5 years ago

I build a Windows system command to call CLI from R, as follows. I presume something similar would be practical in Python.

tc.wincmd <- function(tc.fn, tc.dir, tc.cli, taxyear=2013){
  # Build a Windows system command that will call the Tax-Calculator CLI. See:
  #   https://pslmodels.github.io/Tax-Calculator/
  # CAUTION: must use full dir names, not relative to working directory
  # 2013 is the FIRST possible tax year that Tax-Calculator will do

  tc.infile.fullpath <- shQuote(paste0(paste0(tc.dir, tc.fn)))
  tc.outdir <- shQuote(str_sub(tc.dir, 1, -1)) # must remove trailing "/"

  cmd <- paste0(tc.cli, " ", tc.infile.fullpath, " ", taxyear, " ", "--dump --outdir ", tc.outdir)
  return(cmd)
}

# Here are examples of how the inputs to the function are defined on my machine:
tc.fn <- "tcbase.csv" # a file with variable names as used by Tax-Calculator

# private directory for Tax-Calculator record-level output that we don't want moved from this machine
tc.dir <- "D:/tcdir/"

tc.cli <- "C:/ProgramData/Anaconda3/Scripts/tc" # location of Tax-Calculator command-line interface

system(tc.wincmd(tc.fn, tc.dir, tc.cli))

andersonfrailey commented 5 years ago

@MaxGhenis asked:

what would be the easiest way to call Tax-Calculator on synpuf to (a) ensure validity and (b) calculate AGI, from Python?

I would say what @donboyd5 has done works. Alternatively we could create a short python script that does something like take the name of the file as an argument, run it through Tax-Calculator, and save a new file with AGI included. Would we also want to run the file through the reforms we've included in this repo? I'd be happy to work on this.

donboyd5 commented 5 years ago

Great item for our call today. I think it would be very valuable to stack one or more files together (e.g., puf and synpuf variants 1 and 2) and then run them through Tax-Calculator. It would be important to not simply duplicate what @feenberg is doing. A few thoughts about that:

of course it would be Tax-Calculator which is probably the most important end use so that alone is valuable
might make sense to do fewer reforms but go deeper - e.g., pick a few of the most important reforms in the @feenberg list, and examine how performance on the reforms varies across important subgroups such as AGI range, and marital status
and do some diagnostic detective work to figure out, where the files perform differently on different reforms, what makes them perform differently, with the goal of providing feedback to @MaxGhenis to see which variables, in which parts of the distribution or in which combinations with other variables, seem in need of improvement, and of course to determine extent to which differences may be related to weight differences as opposed to file differences
because of that last point, it might make sense to run some reforms on unweighted files so that weights are removed as an explanation for differences

donboyd5 commented 5 years ago

Simply FYI: @MaxGhenis I suspect your blocking approach with +/- 1 income groups will be sufficiently fast, but here are two possible additional ideas to consider if it turns out to be too computationally-intensive:

For rankings of Euclidean distances, you don't need to take the square root, which this argues is computationally expensive, although I would not have thought so
parDist could be called from Python. It says,

The parallelDist package provides a fast parallelized alternative to R’s native dist function to calculate distance matrices for continuous, binary, and multi-dimensional input matrices and offers a broad variety of predefined distance functions from the stats, proxy and dtw R packages, as well as support for user-defined distance functions written in C++.

MaxGhenis commented 5 years ago

Interesting, I'll try the sqeuclidean which this suggests could be faster.

Based on my Python kernel crashing, I suspect the bigger issue is the large matrices being stored in memory, rather than the computation time. More efficient from this perspective would then be finding the nearest record for each synthetic record, rather than aggregating the full matrix at the end. This will need to be vectorized since I'd expect loops to take forever, and parallelizing like parDist could help. This SO question could be relevant.

MaxGhenis commented 5 years ago

Also https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.cKDTree.html

donboyd5 / synpuf

Making distance calculations manageable #22