hdoupe commented 4 years ago

This PR re-writes the ParamTools query API so that it is more flexible and familiar to users in the pydata ecosystem:

from taxcalc import Policy # a Parameters-using class

pol = Policy()

pol.sel["EITC_c"]["year"] == 2020

# QueryResult([
#   {'value': 537.36, 'EIC': '0kids', 'year': 2020, '_auto': True}
#   {'value': 3581.71, 'EIC': '1kid', 'year': 2020, '_auto': True}
#   {'value': 5920.08, 'EIC': '2kids', 'year': 2020, '_auto': True}
#   {'value': 6660.6, 'EIC': '3+kids', 'year': 2020, '_auto': True}
# ])

It replaces the query backend added in PR #74 bringing with it 4 main advantages:

Much simpler. This implementation is based directly off of the ordered-list example in the Python Documentation. Despite the detailed docstring in the tree.py module, I still have trouble tracking down and fixing bugs in the complicated search and update methods.
Much more flexible. Users can apply standard comparison and logical operators like &, <, etc. to query and chain together query results.
Custom ordering functions. Users can define their own ordering functions if their values are not already orderable. For example, if you define a custom type that is a dictionary, then you will get this error if you try to sort a list of them:
```
sorted([
    {"a": 2, "b": 3},
    {"a": 1, "b": 2}
])
# TypeError: '<' not supported between instances of 'dict' and 'dict'
```
But if you supply a key to sort on then Python can sort your list:
```
sorted([
    {"a": 2, "b": 3},
    {"a": 1, "b": 2}
], key=lambda x: (x["a"], x["b"]))
# [{'a': 1, 'b': 2}, {'a': 2, 'b': 3}]
```
The same idea is used in this PR.
A familiar API. The API is inspired by the Pandas .loc function. I considered directly copying the loc function, but since the behavior is a little different (e.g. no slice or column selection behavior), I used sel as an abbreviation for the existing select_* based API. My intention is to use the same pattern without confusing users who may think that they are working with a dataframe.

Here are 3 examples demonstrating the points above:

Use the sel attribute to query parameter values:

from taxcalc import Policy

pol = Policy()

pol.sel["STD"]["year"] == 2026

# QueryResult([
#   {'value': 7651.0, 'MARS': 'single', 'year': 2026}
#   {'value': 15303.0, 'MARS': 'mjoint', 'year': 2026}
#   {'value': 7651.0, 'MARS': 'mseparate', 'year': 2026}
#   {'value': 11266.0, 'MARS': 'headhh', 'year': 2026}
#   {'value': 15303.0, 'MARS': 'widow', 'year': 2026}
# ])

Chain together queries using Python logical operators and order of operations:

(
    (pol.sel["STD"]["_auto"] == True) & 
    ((pol.sel["STD"]["year"] >= 2025) | (pol.sel["STD"]["MARS"] == "single"))
)

# QueryResult([
#   {'value': 12392.76, 'MARS': 'single', 'year': 2020, '_auto': True}
#   {'value': 12662.92, 'MARS': 'single', 'year': 2021, '_auto': True}
#   {'value': 12950.37, 'MARS': 'single', 'year': 2022, '_auto': True}
#   {'value': 13249.52, 'MARS': 'single', 'year': 2023, '_auto': True}
#   {'value': 13543.66, 'MARS': 'single', 'year': 2024, '_auto': True}
#   {'value': 13836.2, 'MARS': 'single', 'year': 2025, '_auto': True}
#   {'value': 7805.55, 'MARS': 'single', 'year': 2027, '_auto': True}
#   {'value': 7961.66, 'MARS': 'single', 'year': 2028, '_auto': True}
#   {'value': 8119.3, 'MARS': 'single', 'year': 2029, '_auto': True}
#   {'value': 8280.87, 'MARS': 'single', 'year': 2030, '_auto': True}
#   {'value': 27672.42, 'MARS': 'mjoint', 'year': 2025, '_auto': True}
#   {'value': 15612.12, 'MARS': 'mjoint', 'year': 2027, '_auto': True}
#   {'value': 15924.36, 'MARS': 'mjoint', 'year': 2028, '_auto': True}
#   {'value': 16239.66, 'MARS': 'mjoint', 'year': 2029, '_auto': True}
#   {'value': 16562.83, 'MARS': 'mjoint', 'year': 2030, '_auto': True}
#   {'value': 13836.2, 'MARS': 'mseparate', 'year': 2025, '_auto': True}
#   {'value': 7805.55, 'MARS': 'mseparate', 'year': 2027, '_auto': True}
#   {'value': 7961.66, 'MARS': 'mseparate', 'year': 2028, '_auto': True}
#   {'value': 8119.3, 'MARS': 'mseparate', 'year': 2029, '_auto': True}
#   {'value': 8280.87, 'MARS': 'mseparate', 'year': 2030, '_auto': True}
#   {'value': 20811.01, 'MARS': 'headhh', 'year': 2025, '_auto': True}
#   {'value': 11493.57, 'MARS': 'headhh', 'year': 2027, '_auto': True}
#   {'value': 11723.44, 'MARS': 'headhh', 'year': 2028, '_auto': True}
#   {'value': 11955.56, 'MARS': 'headhh', 'year': 2029, '_auto': True}
#   {'value': 12193.48, 'MARS': 'headhh', 'year': 2030, '_auto': True}
#   {'value': 27672.42, 'MARS': 'widow', 'year': 2025, '_auto': True}
#   {'value': 15612.12, 'MARS': 'widow', 'year': 2027, '_auto': True}
#   {'value': 15924.36, 'MARS': 'widow', 'year': 2028, '_auto': True}
#   {'value': 16239.66, 'MARS': 'widow', 'year': 2029, '_auto': True}
#   {'value': 16562.83, 'MARS': 'widow', 'year': 2030, '_auto': True}
# ])

Define your own ordering function. CCC has a custom value type that is not orderable on its own, but with a custom ordering function it is:

# https://github.com/PSLmodels/Cost-of-Capital-Calculator/compare/master...hdoupe:pt-demo
class DepreciationRules(ma.Schema):
    # set some field validation ranges that can't set in JSON
    life = ma.fields.Float(validate=ma.validate.Range(min=0, max=100))
    method = ma.fields.String(
        validate=ma.validate.OneOf(choices=[
            "SL", "Expensing", "DB 150%", "DB 200%", "Economic"])
    )

    def cmp_funcs(self):
        return {
            "key": lambda x: (x["life"], x["method"])
        }

from ccc.parameters import DepreciationParams

dp = DepreciationParams()

dp.sel["asset"]["value"] == {"life": 100, "method": "SL"}

# QueryResult([
#   {'BEA_code': 'LAND', 'ADS_life': 100.0, 'system': 'GDS', 'GDS_life': 100.0, 'value': {'method': 'SL', 'life': 100.0}, 'major_asset_group': 'Land', 'year': 2020, 'asset_name': 'Land', 'minor_asset_group': 'Land'}
#   {'BEA_code': 'INV', 'ADS_life': 100.0, 'system': 'GDS', 'GDS_life': 100.0, 'value': {'method': 'SL', 'life': 100.0}, 'major_asset_group': 'Inventories', 'year': 2020, 'asset_name': 'Inventories', 'minor_asset_group': 'Inventories'}
# ])

hdoupe commented 4 years ago

(This PR is not backwards-compatible right now, but it will be before it is merged.)

jdebacker commented 4 years ago

@hdoupe This looks really good- thanks for your work on ParamTools!

hdoupe commented 4 years ago

@hdoupe This looks really good- thanks for your work on ParamTools!

Thanks @jdebacker!

hdoupe commented 4 years ago

With the latest commits:

Internal uses of queries swap to the new API, including doing adjustments.
select.py is uses new Query API but is backwards compatible (except for niche uses of custom comparison or index-related comparison functions).
tree.py is removed.
More flexible indexing by storing the list of parameter values as a dictionary where the keys are the indices instead of as a list.

Additions to the API:

Create Values from a list of values:

adjustment = {"myparam": [{"value": 1, "label": "someval"}]
vals = params.sel[adustment["myparam"]]  # returns a Values instance

Aggregation functions like union and intersection to combine many results:


params = WeatherParams()
queryresults = []
for label, value in {"temperature": "hot", "precipitation": "little", "wind": "variable"}.items():
queryresults.append(params.sel["weather"][label] == value)

return intersection(queryresults)

- A `QueryResult` is just a view on top of a `Values` object, but you can "persist" the subset of values returned in the query by converting the `QueryResult` into a `Values` object like `queryresult.as_values()`. This makes it possible to modify the underlying data:
```python
new_value = [
    {
        "temperature": "moderate",
        "precipitation": "heavy",
        "wind": "strong",
        "value": "hurricane",
    }
]

queryresults = params.sel["weather"]["precipiatation"] == "heavy"

updated = queryresults.as_values().insert(new_value)
params.adjust({"weather": updated})

Also, I tested the API both for feel and for correctness in Tax-Calculator where tests passed locally both as-is using the backwards compatible select module on the master branch and with the new api here: https://github.com/hdoupe/Tax-Calculator/commit/4cae756c15b260c405ab21d90a345ba114e21710.

TODO:

[x] Need to test values module more thoroughly.
[x] Think a little more about the API to see if there are ways to make usage more convenient and to make sure it feels intuitive.
- Queries along multiple labels? Can these be done with one command or do they need to be chained with intersection like this:
```
queryset = params.sel["some_param"]
queryset &= intersection(
queryset.eq(strict=False, **{label: value})
for label, value in other_labels.items()
)
```
- Are the transitions from Values -- > Slice --> QueryResults intuitive?
- get Values: params.sel["myparam"]
- get Slice: params.sel["my_param"]["some_label"]
- get QueryResult: queryresult = params.sel["my_param"]["some_label"] > 1234
- back to Values: new_values = queryresult.as_values()
[x] Performance. So far, I've focused mostly on making sure the API is complete and works well, but I will need to improve performance before this is merged. Currently, this slows the Tax-Calculator tests down considerably (~40 seconds for one module to ~200 seconds). Fortunately, there's a lot that can be done with caching and smarter, more incremental updates to the underlying sorted-list data structure.

hdoupe commented 4 years ago

The latest commits improve performance to be better than current master branch. This is done by:

Removing an unnecessary copy.deepcopy which was responsible for almost 50% of the load time when creating Tax-Calculator's Policy object.
Updating SortedKeyList and Values to support inserting new values without having to re-build the underlying data structures.
Using a set operation to find value objects that are missing a label instead of looping over them and their labels.

hdoupe commented 4 years ago

Just dropped the WIP tag on PR #114. I'm planning to merge once I add documentation for the new query features.

hdoupe commented 4 years ago

Latest commits add:

Indexing for the new Values, Slice, and QueryResult objects:
Adds docs for the new features in this PR and accessing parameter values in general. I'm planning to work back through the docs after this PR to use this example (or similar) for the rest of the documentation.
Fixes some date type related bugs. Now you can use month for the step argument on range validators for Date:
```
import paramtools
```

class Params(paramtools.Parameters): defaults = { "schema": { "labels": { "date": { "type": "date", "validators": { "range": {"min": "2020-01-01", "max": "2021-01-01", "step": {"months": 1}} } } }, }, "a": { "title": "A", "type": "int", "value": [{"date": "2020-01-01", "value": 2}, {"date": "2020-10-01", "value": 8},] }, "b": { "title": "B", "type": "float", "value": [{"date": "2020-01-01", "value": 10.5}] } } params = Params(label_to_extend="date") params.sel["a"]

Values([

{'date': datetime.date(2020, 1, 1), 'value': 2},

{'date': datetime.date(2020, 2, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 3, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 4, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 5, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 6, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 7, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 8, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 9, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 10, 1), 'value': 8},

{'date': datetime.date(2020, 11, 1), 'value': 8, '_auto': True},

{'date': datetime.date(2020, 12, 1), 'value': 8, '_auto': True},

{'date': datetime.date(2021, 1, 1), 'value': 8, '_auto': True},

])

hdoupe commented 4 years ago

Latest commits:

Update SortedKeyList to use sortedcontainers. Now the low-level bisect_left and bisect_right usage is handled by sortedcontainers. This helps ParamTools focus on the higher level query apis and gives it a fast engine for queries.
Fixes some bugs that were found by testing this version of ParamTools against Tax-Cruncher, Tax-Brain, and Cost-of-Capital-Calculator:
- Custom (/ nested) fields are dumped to a JSON string if the cmp_funcs method is not defined.
- Deprecation warning added for exact_match keyword.
- Throws SortedKeyListException if unable to create the sortedcontainersSortedKeyList object.
Minor performance improvements through smarter caching with sel.
Read-ability improvements in the sort_values method.

PSLmodels / ParamTools

Pandas-like query api #114

Values([

{'date': datetime.date(2020, 1, 1), 'value': 2},

{'date': datetime.date(2020, 2, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 3, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 4, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 5, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 6, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 7, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 8, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 9, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 10, 1), 'value': 8},

{'date': datetime.date(2020, 11, 1), 'value': 8, '_auto': True},

{'date': datetime.date(2020, 12, 1), 'value': 8, '_auto': True},

{'date': datetime.date(2021, 1, 1), 'value': 8, '_auto': True},

])