Set `simulation.use_reported_state_income_tax = True` in `Reproduce in Python` for US nationwide reforms

MaxGhenis commented 4 months ago

Reproduce in Python does not currently reproduce the US microsimulation results in the web app, because we don't set the simulation.use_reported_state_income_tax parameter to True.

anth-volk commented 4 months ago

Will work on this, but need to chat with Nikhil, as when I try running reforms in Python notebooks, I receive other errors and am thus unable to proceed

MaxGhenis commented 4 months ago

Have you tried Colab? e.g. the notebook linked in https://github.com/PolicyEngine/policyengine-us/issues/3634

anth-volk commented 4 months ago

Unfortunately, yes. I ran a policy reform within the app (increased the rate on IRS tax bracket 3) and received an arcane syntax error. I then ran it with a different one, but I'm forgetting the outcome. Let me try again with the notebook you linked, perhaps I made some sort of mistake?

anth-volk commented 4 months ago

Even after integrating the use_reported_state_income_tax setting into the notebook, I receive an error indicating that:

TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

The full notebook is available here. Is there a bug in another portion of the Reproduce in Python code that I'm missing? Or am I just bad at setting up Colab notebooks?

MaxGhenis commented 4 months ago

I'd suggest testing with calc rather than calculate_dataframe, which is finicky in several regards. See #653

anth-volk commented 4 months ago

Could definitely do that; that said, what I have in the Colab notebook is what we output currently in the app. Should #653 be reopened to change that verbiage?

nikhilwoodruff commented 4 months ago

We should remove in_poverty from the snippet or remove the last line

anth-volk commented 4 months ago

Upon further testing - if we merely remove in_poverty and maintain the dataframes, everything works properly. However, with the simple calculate, I actually receive the following error:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

[<ipython-input-6-4c6a3821c8c3>](https://localhost:8080/#) in <cell line: 26>()
     24 reformed = Microsimulation(reform=reform)
     25 HOUSEHOLD_VARIABLES = ["person_id", "household_id", "age", "household_net_income", "household_income_decile", "household_tax", "household_benefits"]
---> 26 baseline_person_df = baseline.calculate(HOUSEHOLD_VARIABLES, 2024)
     27 reformed_person_df = reformed.calculate(HOUSEHOLD_VARIABLES, 2024)
     28 difference_person_df = reformed_person_df - baseline_person_df

1 frames

[/usr/local/lib/python3.10/dist-packages/policyengine_core/simulations/simulation.py](https://localhost:8080/#) in calculate(self, variable_name, period, map_to, decode_enums)
    371         )
    372 
--> 373         np.random.seed(hash(variable_name + str(period)) % 1000000)
    374 
    375         try:

TypeError: can only concatenate list (not "str") to list

It seems to me that switching over to calculate would require a bit more reworking, so if you're looking for a quick turnaround that ensures this code works properly, I could merely remove in_poverty for all contexts and add the use_reported_state_income_tax for US-wide simulations, then we could keep #653 open and work on that separately.

MaxGhenis commented 4 months ago

calculate/calc only takes a single variable at a time

anth-volk commented 4 months ago

As I'm looking through this, does it make sense to replace calculate_dataframe with calculate, since we'd have to place the resulting series of series within some sort of container to allow for an extra dimension of data? I noticed that Nikhil also seems to have objected to using the calculate function in the original #653.

MaxGhenis commented 4 months ago

We encourage analysts to use calc. @nikhilwoodruff if you stand by that comment could you share your justification? Microdataframes have caused several bugs in usage and most of the analysis I've seen uses series.

anth-volk commented 4 months ago

The main reason I ask is only because it doesn't appear to be possible to use calc to calculate all of the output variables we currently include over the data series we include without separately running a calculation for each output variable. Would you like that written into the code?

Alternatively, I worked up a Colab notebook that removes all of the output variables except household net income with the CTC as an example policy - would that be the preferred direction to move in?

MaxGhenis commented 4 months ago

Let's prepopulate with the code from your notebook

anth-volk commented 4 months ago

Reopening due to #1274.

PolicyEngine / policyengine-app

Set `simulation.use_reported_state_income_tax = True` in `Reproduce in Python` for US nationwide reforms #1179