PSLmodels / Tax-Calculator

USA Federal Individual Income and Payroll Tax Microsimulation Model
https://taxcalc.pslmodels.org
Other
264 stars 158 forks source link

Questions about default_data() function in parameters.py #342

Closed martinholmer closed 9 years ago

martinholmer commented 9 years ago

There is no documentation on this function, so I'm confused about what it is supposed to return.

Read the following test (which is on the master branch) and look for my questions below.

def test_parameters_get_default_start_year():
    paramdata = taxcalc.parameters.default_data(start_year=2015, metadata=True)
    # 1D data, has 2015 values
    meta_II_em = paramdata['_II_em']
    assert meta_II_em['start_year'] == 2015
    assert meta_II_em['row_label'] == ["2015"]
    assert meta_II_em['value'] == [4000]

    # 2D data, has 2015 values
    meta_std_aged = paramdata['_STD_Aged']
    assert meta_std_aged['start_year'] == 2015
    assert meta_std_aged['row_label'] == ["2015"]
    assert meta_std_aged['value'] == [[1550, 1250, 1250, 1550, 1550, 1250]]

    # 1D data, doesn't have 2015 values, is CPI inflated
    meta_amt_thd_marrieds = paramdata['_AMT_thd_MarriedS']
    assert meta_amt_thd_marrieds['start_year'] == 2015
    assert meta_amt_thd_marrieds['row_label'] == ["2015"]
    # Take the 2014 rate and multiply by inflation for that year
    assert meta_amt_thd_marrieds['value'] == (
        [41050 * (1.0 + Parameters._Parameters__rates[2014])])

    # 1D data, doesn't have 2015 values, is not CPI inflated
    meta_kt_c_age = paramdata['_KT_c_Age']
    assert meta_kt_c_age['start_year'] == 2015
    assert meta_kt_c_age['row_label'] == ""
    assert meta_kt_c_age['value'] == [24]

    # 1D data, does have 2015 values, goes up to 2018
    meta_ctc_c = paramdata['_CTC_c']
    assert meta_ctc_c['start_year'] == 2015
    assert meta_ctc_c['row_label'] == ["2015", "2016", "2017", "2018"]
    assert meta_ctc_c['value'] == [1000, 1000, 1000, 500]

I have two sets of questions about the above test.

First, why does the _AMT_thd_MarriedS parameter return ["2015"] for its 'row_label' attribute when the _KT_c_Age parameter returns ""? Why the difference? Neither have a 2015 row label value in params.json, so why are they treated differently?

Second, why does the _CTC_c parameter have row labels and values beyond 2015 when almost all the other parameters stop at 2015? To what use are the 2016-2018 values of this parameter being put by scripts calling the default_data() function? In other words, does the webapp really use default_data() results for years beyond 2015?

talumbau commented 9 years ago

First, why does the _AMT_thd_MarriedS parameter return ["2015"] for its 'row_label' attribute when the _KT_c_Age parameter returns ""? Why the difference? Neither have a 2015 row label value in params.json, so why are they treated differently?

_KT_c_Age has the empty string for a row label, whereas _AMT_thd_MarriedS has a row label of ["2013", "2014"], so the convention for the function is currently 'if there is currently something in row_label, we assume those are year entries and they need to be incremented if we are given a new start_year. If row_label is the empty string, just give that back as the row_label for that parameter.'

Second, why does the _CTC_c parameter have row labels and values beyond 2015 when almost all the other parameters stop at 2015? To what use are the 2016-2018 values of this parameter being put by scripts calling the default_data() function? In other words, does the webapp really use default_data() results for years beyond 2015?

I don't know why _CTC_c has values beyond 2015, while the other parameters don't. I'm assuming it's because these values were readily available and so were just dumped in. Maybe @Amy-Xu can add detail. There was an issue filed some time ago (can't seem to find it right now) about row labels but I don't think we strictly enforce anything here, so there is room for a different implementation in terms of what to return for row_label values. In terms of the webapp, I looked quickly through the code. We use the value for the year specified in the start_year argument to populate the webapp for the "grayed out" initial values. Next year, we would increment that to 2016, but for this year, we use the 2015 value for each parameter.

talumbau commented 9 years ago

Also, it is bad that this function has no doc string. That's on me since I wrote it. One result of this issue should be the creation of a doc string for the function.

martinholmer commented 9 years ago

T.J. @talumbau said in part:

_KT_c_Age has the empty string for a row label, whereas _AMT_thd_MarriedS has a row label of ["2013", "2014"], so the convention for the function is currently 'if there is currently something in row_label, we assume those are year entries and they need to be incremented if we are given a new start_year. If row_label is the empty string, just give that back as the row_label for that parameter.

Well, I guess I could make the new default_data classmethod work that way, but the logic doesn't make any sense to me. What am I missing? Don't TaxBrain users want to know the year for which you are showing them the value of _KT_c_Age? We are showing them the value for 2015 (right?), so why not have 2015 as the row label? Is the empty row_label for _KT_c_Age in params.json simply because Amy @Amy-Xu has been working on so many other more important issues? Or, does it signify something significant?

martinholmer commented 9 years ago

T.J. @talumbau said in part:

I don't know why _CTC_c has values beyond 2015, while the other parameters don't. I'm assuming it's because these values were readily available and so were just dumped in. Maybe @Amy-Xu can add detail. There was an issue filed some time ago (can't seem to find it right now) about row labels but I don't think we strictly enforce anything here, so there is room for a different implementation in terms of what to return for row_label values. In terms of the webapp, I looked quickly through the code. We use the value for the year specified in the start_year argument to populate the webapp for the "grayed out" initial values. Next year, we would increment that to 2016, but for this year, we use the 2015 value for each parameter.

I know why the 2016-2018 values are in the params.json file and that is just fine. My questions were how the TaxBrain webapp uses the 'value' attribute returned by the current default_data() function. And, it seems you have answered my question. You are saying that the only parameter 'value' information used by TaxBrain is the 'value' for the year specified in the default_data(...) function call, which is currently 2015 but will increase to 2016 sometime in the future. Right?

talumbau commented 9 years ago

Well, I guess I could make the new default_data classmethod work that way, but the logic doesn't make any sense to me. What am I missing? Don't TaxBrain users want to know the year for which you are showing them the value of _KT_c_Age? We are showing them the value for 2015 (right?), so why not have 2015 as the row label? Is the empty row_label for _KT_c_Age in params.json simply because Amy @Amy-Xu has been working on so many other more important issues? Or, does it signify something significant?

The logic of which years go with which values is based on start_year for a parameter, which all parameters are required to have. Now that I look closely at it, it seems that row_label could really just be derived from looking at start_year. I don't know the story behind the creation of row_label. We definitely need col_label, but row_label seems like it could only provide information that would conflict with start_year. In other words, if you know start_year, and you have some sequence of either 1D or 2D parameters, you can get the proper row labels. Also, if row_label is specified, since you already have start_year, it is either redundant information or, if it is not redundant, it is conflicting. That's what it currently looks like to me. I think row_labels with the empty string are just the result of no one having the time to fill them in.

talumbau commented 9 years ago

I know why the 2016-2018 values are in the params.json file and that is just fine. My questions were how the TaxBrain webapp uses the 'value' attribute returned by the current default_data() function. And, it seems you have answered my question. You are saying that the only parameter 'value' information used by TaxBrain is the 'value' for the year specified in the default_data(...) function call, which is currently 2015 but will increase to 2016 sometime in the future. Right?

I don't think it's quite that simple. You can look at nearly all of the uses of default_data here:

https://github.com/OpenSourcePolicyCenter/webapp-public/blob/master/webapp/apps/taxbrain/helpers.py

The issue is that the user may specify a sequence of values for a parameter. Since they are forced to start at 2015, there are some parameters where their choice overwrites a default parameter value, but some where the params.json has a value that we need to use. The situation is complicated by the 2D parameters, since a user may specify a sequence of values for one part of a 2D parameter (so some value that only applies to singles, for example) but then only enter one value for the rest. So they are inputting a multi-year reform but only specifying the values for part of that parameter, and then either default values themselves or our hard-coded inflation rates kick in for the other parts of that parameter. So my comment above is not fully correct: we use default_data for filling in the grayed out boxes on TaxBrain, but we also use default_data as a starting point for the user specified reform, modifying as needed based on what they have entered.

martinholmer commented 9 years ago

T.J. @talumbau said:

The logic of which years go with which values is based on start_year for a parameter, which all parameters are required to have. Now that I look closely at it, it seems that row_label could really just be derived from looking at start_year. I don't know the story behind the creation of row_label. We definitely need col_label, but row_label seems like it could only provide information that would conflict with start_year. In other words, if you know start_year, and you have some sequence of either 1D or 2D parameters, you can get the proper row labels. Also, if row_label is specified, since you already have start_year, it is either redundant information or, if it is not redundant, it is conflicting. That's what it currently looks like to me. I think row_labels with the empty string are just the result of no one having the time to fill them in.

In the new default_data Parameters classmethod (as opposed to the legacy global default_data function), I'm doing this to set the 'row_label' attribute:

           [snip]
           value_year_str = '{}'.format(value_year)
           for name, data in params.items():
                data['start_year'] = value_year
                data['row_label'] = [value_year_str]
                [snip]
           [snip]

where value_year is the default_data argument that used to be called start_year (changed to eliminate possible confusion with Parameters.start_year, which means something totally different). So, it would seem that it is easy to eliminate in this way any inconsistencies between the TaxBrain START_YEAR and the 'row_label' returned by default_data.

In terms of the content of params.json, perhaps I should add a test to ensures all parameters have row_label lists. Does that make sense? Amy @Amy-Xu do you have any thoughts on this test.

talumbau commented 9 years ago

I think it would be good to get an answer to this question: for a given parameter, if I know the start_year for a parameter and its value, is there some situation in which I would be surprised by its row_label? Another way to ask the question: is there any circumstance under which a correct row_label is other than:

row_label = [str(start_year + i) for i in range(len(value))]
talumbau commented 9 years ago

Oh, the reason to ask the question is that, if the answer is 'no', then maybe we should remove row_label from params.json, since it would either be redundant or a source of potential error.

martinholmer commented 9 years ago

T.J. @talumbau said:

I think it would be good to get an answer to this question: for a given parameter, if I know the start_year for a parameter and its value, is there some situation in which I would be surprised by its row_label? Another way to ask the question: is there any circumstance under which a correct row_label is other than: row_label = [str(start_year + i) for i in range(len(value))]

I think that is a very good question, which I'll try to answer in the next few minutes.

What I've just done is see how many of the parameters in params.json do not have a row_labels attribute that is a list. The answer seems to be that 57 out of the total 84 do not have a row_labels list.

So, if I can confirm your hypothesis for the 27 parameters with a row_label list, I suppose we should drop the row_label attribute from params.json and generate a row_labels list in the code using your statement above.

martinholmer commented 9 years ago

Here is the answer to the question posed by T.J. @talumbau:

___________________________ test_params_json_content ___________________________

    def test_params_json_content():
        ppo = Parameters()
        params = getattr(ppo, '_vals')
        num_parameters = 0
        num_missing_row_labels = 0
        for name, data in params.items():
            num_parameters += 1
            start_year = data.get('start_year')
            assert isinstance(start_year, int)
            assert start_year == Parameters.JSON_START_YEAR
            row_label = data.get('row_label')
            if isinstance(row_label, list):
                value = data.get('value')
                expected = [str(start_year + i) for i in range(len(value))]
                assert row_label == expected
            else:
                num_missing_row_labels += 1
        if num_missing_row_labels != 0:
            sys.stdout.write('num_parameters={}\n'.format(num_parameters))
>           assert num_missing_row_labels == 0
E           assert 57 == 0

tests/test_parameters.py:64: AssertionError
----------------------------- Captured stdout call -----------------------------
num_parameters=84

So, for the 27 (out of 84) parameters that have a row_label list, that list is exactly as expected and as can be generated with the single statement suggested by T.J.: [str(start_year + i) for i in range(len(value))]

By the way, the above test is part of the parameters_api_redesign branch I am working on.

T.J. @talumbau and Amy @Amy-Xu, how do you want to proceed? It seems as if the webapp code should be changed to construct the row_label list first, and then the row_label attribute for each parameter in the params.json file could be deleted. Does this make sense?

Amy-Xu commented 9 years ago

I'll read this issue more thoroughly, but a few notes here:

  1. Regarding @martinholmer's question on why we have values for CTC_c beyond 2015: this parameter is for the celling of child tax credit. This limit will be decreased from 1000 to 500 in 2018, so we want to reflect this change in the parameter. For other parameter, we do not have any information on how they will look like beyond 2015, so no values/labels available.
  2. CTC_c and many other parameters in params.json are not included in TaxBrain right now. Eventually we are going to include all of them. I'm not quiet sure why we pick some of them but not others. @MattHJensen might have a better explanation for this.
MattHJensen commented 9 years ago

Some historical background:

The reason we originally included row_labels in params.json was to serve as meta data to make it easier for a human to read the list of values. The row_label attributes do create opportunity for confusion or inaccuracy, so perhaps it is better to remove them -- I do not have a strong opinion.

Amy said regarding which params we provide for TaxBrain users:

I'm not quiet sure why we pick some of them but not others. @MattHJensen might have a better explanation for this.

We just picked a subset that we thought were important to our users to start out with.

Amy-Xu commented 9 years ago

@talumbau There's no exception for any parameter regarding this: row_label = [str(start_year + i) for i in range(len(value))] If we want to fill out all the current empty row_labels, this formula will be the way to go. But this row_label attribute is not used in any calculation, and its existence here is merely for documentary purposes. I would propose to add documentation elsewhere to make the 'value' attribute clearer, rather than adding more stuff to row_labels.

martinholmer commented 9 years ago

Amy @Amy-Xu said:

If we want to fill out all the current empty row_labels, this formula will be the way to go. But this row_label attribute is not used in any calculation, and its existence here is merely for documentary purposes. I would propose to add documentation elsewhere to make the 'value' attribute clearer, rather than adding more stuff to row_labels.

OK, helping people have any easier time reading the params.json file is definitely an important goal. If the row_label list can help that, then let us keep that attribtute in the params.json file. We can enforce the consistency and completeness of row labels using a simple Parameters class test. Perhaps this is the better way to go: it gives consistency plus better params.json readabliltiy than simply dropping the attribute from the params.json file. How would you weigh the pros and cons of these two approaches, @MattHJensen , T.J. @talumbau , and @Amy-Xu ?

MattHJensen commented 9 years ago

@martinholmer, is this issue ready to be closed?

martinholmer commented 9 years ago

This discussion was useful in developing the redesign of the Parameters class API, which was implemented in pull request #343 that was committed to the master branch on September 8, 2015.