Open MaxGhenis opened 6 years ago
I prefer to keep it general - Synthetic Household File or Synthetic Household Policy-Analysis File. I am not sure they will always be tax units; they certainly won't always be (and aren't always now) tax-filing units. If we do state-level analysis we may be very concerned about sales taxes or benefit issues, which won't always be driving by income tax filing status.
That said, it is just a preference. I don't feel strongly.
I agree on generality, which led me to the term "microdata," but I also don't feel strongly.
This raises questions around how these other enhancements play with the synthetic PUF (ideally as well as the real one, if Approach A in #11 is adopted), and what the real-PUF version of SHF would be called.
One potential end state would be to create two libraries:
synpuf
, which has one key function, synthesize()
, taking the raw PUF file and producing the synthetic one.taxdata
, which has one key function: enhance()
, taking either the raw or synthetic PUF and producing the file used for Tax-Calculator. The synthetic file could be called SHF, real-PUF one name TBD.All is to say, not sure we should rush to change this right now.
@MattHJensen @andersonfrailey
The two-function approach @MaxGhenis describes seems very clean to me. Even if, in practice, there is human intervention each time we run either of the steps (examining results, etc.), it is a nice separation that keeps the projects distinct and working well with each other. Curious for others' thoughts on this and on issue #11.
For our email to SOI, we decided to change the name to Synthetic Household File (SHF?) to better communicate that this project extends beyond the PUF (incorporating nonfilers, imputing other features, potentially different record count, etc.). This may also preemptively avoid a naming conflict with the TPC project, which seems like it might be called "synthetic PUF."
Should we adopt this name generally?
Also should we consider a term other than "household" since we're looking at tax units? For example, "Synthetic Microdata File?" Other ideas welcome.