donboyd5 / synpuf

Synthetic PUF
MIT License
4 stars 3 forks source link

Making distance calculations manageable #22

Open donboyd5 opened 5 years ago

donboyd5 commented 5 years ago

Issue:

We need a better way to reduce the problem size.

donboyd5 commented 5 years ago

Here is an approach I have used with large matching problems in the past that has worked well:

Anyway, this is one approach that should be practical. Variants are possible, including a variant that incorporates ideas from the predictive mean matching that OSPC does between the PUF and CPS.

MaxGhenis commented 5 years ago

This makes sense but probably precludes usage of scipy.cdist, which runs all pairwise comparisons for two tables. Since this is more optimized (written in C) how about adapting it to still use buckets, and include +/-1 bucket, like this (could be AGI or something simpler): image

donboyd5 commented 5 years ago

I think that's a great alternative. If it proves too slow, maybe then we should reconsider the indexing.

On Fri, Dec 14, 2018 at 7:49 PM Max Ghenis notifications@github.com wrote:

This makes sense but probably precludes usage of scipy.cdist, which runs all pairwise comparisons for two tables. Since this is more optimized (written in C) how about adapting it to still use buckets, and include +/-1 bucket, like this (could be AGI or something simpler): [image: image] https://user-images.githubusercontent.com/6076111/50036753-41f24400-ffc0-11e8-835b-bb557054a2f0.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/22#issuecomment-447520736, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmLDzRXNoDbptGVyOEMpnwf06UQH5ks5u5Ec3gaJpZM4ZRMv0 .

MaxGhenis commented 5 years ago

Could you share how you're running Tax-Calculator? Is it from the CLI?

@andersonfrailey what would be the easiest way to call Tax-Calculator on synpuf to (a) ensure validity and (b) calculate AGI, from Python?

donboyd5 commented 5 years ago

I build a Windows system command to call CLI from R, as follows. I presume something similar would be practical in Python.

tc.wincmd <- function(tc.fn, tc.dir, tc.cli, taxyear=2013){
  # Build a Windows system command that will call the Tax-Calculator CLI. See:
  #   https://pslmodels.github.io/Tax-Calculator/
  # CAUTION: must use full dir names, not relative to working directory
  # 2013 is the FIRST possible tax year that Tax-Calculator will do

  tc.infile.fullpath <- shQuote(paste0(paste0(tc.dir, tc.fn)))
  tc.outdir <- shQuote(str_sub(tc.dir, 1, -1)) # must remove trailing "/"

  cmd <- paste0(tc.cli, " ", tc.infile.fullpath, " ", taxyear, " ", "--dump --outdir ", tc.outdir)
  return(cmd)
}

# Here are examples of how the inputs to the function are defined on my machine:
tc.fn <- "tcbase.csv" # a file with variable names as used by Tax-Calculator

# private directory for Tax-Calculator record-level output that we don't want moved from this machine
tc.dir <- "D:/tcdir/"

tc.cli <- "C:/ProgramData/Anaconda3/Scripts/tc" # location of Tax-Calculator command-line interface

system(tc.wincmd(tc.fn, tc.dir, tc.cli))
andersonfrailey commented 5 years ago

@MaxGhenis asked:

what would be the easiest way to call Tax-Calculator on synpuf to (a) ensure validity and (b) calculate AGI, from Python?

I would say what @donboyd5 has done works. Alternatively we could create a short python script that does something like take the name of the file as an argument, run it through Tax-Calculator, and save a new file with AGI included. Would we also want to run the file through the reforms we've included in this repo? I'd be happy to work on this.

donboyd5 commented 5 years ago

Great item for our call today. I think it would be very valuable to stack one or more files together (e.g., puf and synpuf variants 1 and 2) and then run them through Tax-Calculator. It would be important to not simply duplicate what @feenberg is doing. A few thoughts about that:

donboyd5 commented 5 years ago

Simply FYI: @MaxGhenis I suspect your blocking approach with +/- 1 income groups will be sufficiently fast, but here are two possible additional ideas to consider if it turns out to be too computationally-intensive:

The parallelDist package provides a fast parallelized alternative to R’s native dist function to calculate distance matrices for continuous, binary, and multi-dimensional input matrices and offers a broad variety of predefined distance functions from the stats, proxy and dtw R packages, as well as support for user-defined distance functions written in C++.

MaxGhenis commented 5 years ago

Interesting, I'll try the sqeuclidean which this suggests could be faster.

Based on my Python kernel crashing, I suspect the bigger issue is the large matrices being stored in memory, rather than the computation time. More efficient from this perspective would then be finding the nearest record for each synthetic record, rather than aggregating the full matrix at the end. This will need to be vectorized since I'd expect loops to take forever, and parallelizing like parDist could help. This SO question could be relevant.

MaxGhenis commented 5 years ago

Also https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.cKDTree.html