randomness of taxdata - Githubissues

bodiyang commented 1 year ago

Taxdata have a behavior of randomness when producing puf.csv file. I did't find if there is a routine, but here are several cases.

This always happening: each run, the order of the columns are reshuffled, but the puf.csv is the same (the projection values are the same) ~ Taxdata is randomly reshuffled the order of columns. e.g. in one run e09900, g20500, ... MIDR; then in another run it produce MIDR, e09900, ....
The produced puf.csv are different, dramatically different, depending on whether I produce the file under taxdata-dev conda env or not. e.g. the wighted total iitax is $2205B when using puf produced under taxdata-dev; the iitax is $7452B when using puf produced not under this environm,ent ~ The $2205B should be correct, basically match with CBO projections. I understand that I should produce it under taxdata-dev env. However, I don't see why they cause such a big difference, since the role of taxdata-dev is to just claim which software packages are required ~ why causing difference in the calculation.
This happens sometimes ~ Taxdata are producing different result in each run, and as a consequence, different run's produced puf.csv will produce different projection values.
Jason and I trying to produce puf file on our computers. Given that we are using same raw puf in 2011, same version of taxcalc and taxdata. The produced puf.csv looks different ~ e.g. showing different iitax weight total projection values I feel this happening sometime, but not very sure ~ it would be a problem if this is real

Any idea of why this is happening? @andersonfrailey @jdebacker @MattHJensen

jdebacker commented 1 year ago

I'm wondering whether the difference is with the taxdata package itself or something else (e.g., package versions for taxdata of taxcalc or their dependencies).

To get to the bottom of this, I think we need more systematic testing of this issue.

For example, I ran the make all command twice for taxdata (using the scripts in the master branch -- and in the taxdata-dev environment). The resulting puf.csv files show no difference:

In [1]: import pandas as pd

In [2]: puf1 = pd.read_csv("./data/puf.csv")

In [3]: puf2 = pd.read_csv("./data/puf_20230814_309pm.csv")

In [4]: diff = puf1 - puf2

In [5]: diff.max()
Out[5]:
DSI         0
EIC         0
FLPDYR      0
MARS        0
MIDR        0
           ..
p22250      0
p23250      0
pencon_p    0
pencon_s    0
s006        0
Length: 94, dtype: int64

In [6]: diff.max().max()
Out[6]: 0

I also compared the puf_weights.csv.gz generated in two runs of taxdata:

In [17]: diff2 = puf_wgts - puf_wgts2

In [18]: diff2.max()
Out[18]:
WT2011    0
WT2012    0
WT2013    0
WT2014    0
WT2015    0
WT2016    0
WT2017    0
WT2018    0
WT2019    0
WT2020    0
WT2021    0
WT2022    0
WT2023    0
WT2024    0
WT2025    0
WT2026    0
WT2027    0
WT2028    0
WT2029    0
WT2030    0
WT2031    0
WT2032    0
WT2033    0
dtype: int64

Though I did note that these differ from the files checked into the master branches of the Tax-Calculator and TaxData repositories (which are the same as one another):

In [19]: diff_tc_td = puf_wgts_taxcalc - puf_wgts_taxdata

In [20]: diff_tc_td.max()
Out[21]:
WT2011    0
WT2012    0
WT2013    0
WT2014    0
WT2015    0
WT2016    0
WT2017    0
WT2018    0
WT2019    0
WT2020    0
WT2021    0
WT2022    0
WT2023    0
WT2024    0
WT2025    0
WT2026    0
WT2027    0
WT2028    0
WT2029    0
WT2030    0
WT2031    0
WT2032    0
WT2033    0
dtype: int64

andersonfrailey commented 1 year ago

I just created the PUF using the createpuf.py script a couple times and each time the column orders were different, but the values were the same. To find out where the columns are getting shuffled, we could create the PUF multiple times, printing out column order at various points in the creation process to see where they get shuffled.

What if we each did that, then compared 1) where the columns get shuffled to see if it's always happening in the same place 2) the final PUFs to see if the only difference is column order?

The produced puf.csv are different, dramatically different, depending on whether I produce the file under taxdata-dev conda env or not

I wonder if this is due to the dependency package versions? I know I haven't updated the packages in my taxdata-dev environment in awhile so they're not the latest versions of things. Maybe there are some changes in the dependency stack that we should worry about? I'm not sure how we test that besides creating a new PUF under all of the dependency version combinations, which sounds incredibly un-fun.

jdebacker commented 1 year ago

@andersonfrailey Thanks for your testing report. I do think the different results were due to different packages being used.

Is the shuffling of columns important? They are labeled, so I believe the order doesn't matter.

andersonfrailey commented 1 year ago

I don't think column shuffling is important. But just so that everyone gets the exact same result when they create the PUF, we could add a line in taxdata/puf/finalprep.py that sorts the columns at the very end.

bodiyang commented 1 year ago

thanks @andersonfrailey @jdebacker

as a note, I have tracked before which step in createpuf.py, causing the column order get reshuffled ~ It is at line 182 when calling finalprep.py

finalprep.py reads cps-matched-puf.csv to produce puf.csv. The former's column order is the same, while the later's order get reshuffled.

andersonfrailey commented 1 year ago

Fixed with PR #436. Closing

PSLmodels / taxdata

randomness of taxdata #433