Open donboyd5 opened 5 years ago
Here is an approach I have used with large matching problems in the past that has worked well:
Anyway, this is one approach that should be practical. Variants are possible, including a variant that incorporates ideas from the predictive mean matching that OSPC does between the PUF and CPS.
This makes sense but probably precludes usage of scipy.cdist
, which runs all pairwise comparisons for two tables. Since this is more optimized (written in C) how about adapting it to still use buckets, and include +/-1 bucket, like this (could be AGI or something simpler):
I think that's a great alternative. If it proves too slow, maybe then we should reconsider the indexing.
On Fri, Dec 14, 2018 at 7:49 PM Max Ghenis notifications@github.com wrote:
This makes sense but probably precludes usage of scipy.cdist, which runs all pairwise comparisons for two tables. Since this is more optimized (written in C) how about adapting it to still use buckets, and include +/-1 bucket, like this (could be AGI or something simpler): [image: image] https://user-images.githubusercontent.com/6076111/50036753-41f24400-ffc0-11e8-835b-bb557054a2f0.png
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/donboyd5/synpuf/issues/22#issuecomment-447520736, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmLDzRXNoDbptGVyOEMpnwf06UQH5ks5u5Ec3gaJpZM4ZRMv0 .
Could you share how you're running Tax-Calculator? Is it from the CLI?
@andersonfrailey what would be the easiest way to call Tax-Calculator on synpuf to (a) ensure validity and (b) calculate AGI, from Python?
I build a Windows system command to call CLI from R, as follows. I presume something similar would be practical in Python.
tc.wincmd <- function(tc.fn, tc.dir, tc.cli, taxyear=2013){
# Build a Windows system command that will call the Tax-Calculator CLI. See:
# https://pslmodels.github.io/Tax-Calculator/
# CAUTION: must use full dir names, not relative to working directory
# 2013 is the FIRST possible tax year that Tax-Calculator will do
tc.infile.fullpath <- shQuote(paste0(paste0(tc.dir, tc.fn)))
tc.outdir <- shQuote(str_sub(tc.dir, 1, -1)) # must remove trailing "/"
cmd <- paste0(tc.cli, " ", tc.infile.fullpath, " ", taxyear, " ", "--dump --outdir ", tc.outdir)
return(cmd)
}
# Here are examples of how the inputs to the function are defined on my machine:
tc.fn <- "tcbase.csv" # a file with variable names as used by Tax-Calculator
# private directory for Tax-Calculator record-level output that we don't want moved from this machine
tc.dir <- "D:/tcdir/"
tc.cli <- "C:/ProgramData/Anaconda3/Scripts/tc" # location of Tax-Calculator command-line interface
system(tc.wincmd(tc.fn, tc.dir, tc.cli))
@MaxGhenis asked:
what would be the easiest way to call Tax-Calculator on synpuf to (a) ensure validity and (b) calculate AGI, from Python?
I would say what @donboyd5 has done works. Alternatively we could create a short python script that does something like take the name of the file as an argument, run it through Tax-Calculator, and save a new file with AGI included. Would we also want to run the file through the reforms we've included in this repo? I'd be happy to work on this.
Great item for our call today. I think it would be very valuable to stack one or more files together (e.g., puf and synpuf variants 1 and 2) and then run them through Tax-Calculator. It would be important to not simply duplicate what @feenberg is doing. A few thoughts about that:
Simply FYI: @MaxGhenis I suspect your blocking approach with +/- 1 income groups will be sufficiently fast, but here are two possible additional ideas to consider if it turns out to be too computationally-intensive:
The parallelDist package provides a fast parallelized alternative to R’s native dist function to calculate distance matrices for continuous, binary, and multi-dimensional input matrices and offers a broad variety of predefined distance functions from the stats, proxy and dtw R packages, as well as support for user-defined distance functions written in C++.
Interesting, I'll try the sqeuclidean
which this suggests could be faster.
Based on my Python kernel crashing, I suspect the bigger issue is the large matrices being stored in memory, rather than the computation time. More efficient from this perspective would then be finding the nearest record for each synthetic record, rather than aggregating the full matrix at the end. This will need to be vectorized since I'd expect loops to take forever, and parallelizing like parDist could help. This SO question could be relevant.
Issue:
We need a better way to reduce the problem size.