PSLmodels / ParamTools

Library for parameter processing and validation with a focus on computational modeling projects
https://paramtools.dev
MIT License
19 stars 14 forks source link

Pandas-like query api #114

Closed hdoupe closed 4 years ago

hdoupe commented 4 years ago

This PR re-writes the ParamTools query API so that it is more flexible and familiar to users in the pydata ecosystem:

from taxcalc import Policy # a Parameters-using class

pol = Policy()

pol.sel["EITC_c"]["year"] == 2020

# QueryResult([
#   {'value': 537.36, 'EIC': '0kids', 'year': 2020, '_auto': True}
#   {'value': 3581.71, 'EIC': '1kid', 'year': 2020, '_auto': True}
#   {'value': 5920.08, 'EIC': '2kids', 'year': 2020, '_auto': True}
#   {'value': 6660.6, 'EIC': '3+kids', 'year': 2020, '_auto': True}
# ])

It replaces the query backend added in PR #74 bringing with it 4 main advantages:

  1. Much simpler. This implementation is based directly off of the ordered-list example in the Python Documentation. Despite the detailed docstring in the tree.py module, I still have trouble tracking down and fixing bugs in the complicated search and update methods.

  2. Much more flexible. Users can apply standard comparison and logical operators like &, <, etc. to query and chain together query results.

  3. Custom ordering functions. Users can define their own ordering functions if their values are not already orderable. For example, if you define a custom type that is a dictionary, then you will get this error if you try to sort a list of them:

    sorted([
        {"a": 2, "b": 3},
        {"a": 1, "b": 2}
    ])
    # TypeError: '<' not supported between instances of 'dict' and 'dict'

    But if you supply a key to sort on then Python can sort your list:

    sorted([
        {"a": 2, "b": 3},
        {"a": 1, "b": 2}
    ], key=lambda x: (x["a"], x["b"]))
    # [{'a': 1, 'b': 2}, {'a': 2, 'b': 3}]

    The same idea is used in this PR.

  4. A familiar API. The API is inspired by the Pandas .loc function. I considered directly copying the loc function, but since the behavior is a little different (e.g. no slice or column selection behavior), I used sel as an abbreviation for the existing select_* based API. My intention is to use the same pattern without confusing users who may think that they are working with a dataframe.

Here are 3 examples demonstrating the points above:

  1. Use the sel attribute to query parameter values:
from taxcalc import Policy

pol = Policy()

pol.sel["STD"]["year"] == 2026

# QueryResult([
#   {'value': 7651.0, 'MARS': 'single', 'year': 2026}
#   {'value': 15303.0, 'MARS': 'mjoint', 'year': 2026}
#   {'value': 7651.0, 'MARS': 'mseparate', 'year': 2026}
#   {'value': 11266.0, 'MARS': 'headhh', 'year': 2026}
#   {'value': 15303.0, 'MARS': 'widow', 'year': 2026}
# ])
  1. Chain together queries using Python logical operators and order of operations:
(
    (pol.sel["STD"]["_auto"] == True) & 
    ((pol.sel["STD"]["year"] >= 2025) | (pol.sel["STD"]["MARS"] == "single"))
)

# QueryResult([
#   {'value': 12392.76, 'MARS': 'single', 'year': 2020, '_auto': True}
#   {'value': 12662.92, 'MARS': 'single', 'year': 2021, '_auto': True}
#   {'value': 12950.37, 'MARS': 'single', 'year': 2022, '_auto': True}
#   {'value': 13249.52, 'MARS': 'single', 'year': 2023, '_auto': True}
#   {'value': 13543.66, 'MARS': 'single', 'year': 2024, '_auto': True}
#   {'value': 13836.2, 'MARS': 'single', 'year': 2025, '_auto': True}
#   {'value': 7805.55, 'MARS': 'single', 'year': 2027, '_auto': True}
#   {'value': 7961.66, 'MARS': 'single', 'year': 2028, '_auto': True}
#   {'value': 8119.3, 'MARS': 'single', 'year': 2029, '_auto': True}
#   {'value': 8280.87, 'MARS': 'single', 'year': 2030, '_auto': True}
#   {'value': 27672.42, 'MARS': 'mjoint', 'year': 2025, '_auto': True}
#   {'value': 15612.12, 'MARS': 'mjoint', 'year': 2027, '_auto': True}
#   {'value': 15924.36, 'MARS': 'mjoint', 'year': 2028, '_auto': True}
#   {'value': 16239.66, 'MARS': 'mjoint', 'year': 2029, '_auto': True}
#   {'value': 16562.83, 'MARS': 'mjoint', 'year': 2030, '_auto': True}
#   {'value': 13836.2, 'MARS': 'mseparate', 'year': 2025, '_auto': True}
#   {'value': 7805.55, 'MARS': 'mseparate', 'year': 2027, '_auto': True}
#   {'value': 7961.66, 'MARS': 'mseparate', 'year': 2028, '_auto': True}
#   {'value': 8119.3, 'MARS': 'mseparate', 'year': 2029, '_auto': True}
#   {'value': 8280.87, 'MARS': 'mseparate', 'year': 2030, '_auto': True}
#   {'value': 20811.01, 'MARS': 'headhh', 'year': 2025, '_auto': True}
#   {'value': 11493.57, 'MARS': 'headhh', 'year': 2027, '_auto': True}
#   {'value': 11723.44, 'MARS': 'headhh', 'year': 2028, '_auto': True}
#   {'value': 11955.56, 'MARS': 'headhh', 'year': 2029, '_auto': True}
#   {'value': 12193.48, 'MARS': 'headhh', 'year': 2030, '_auto': True}
#   {'value': 27672.42, 'MARS': 'widow', 'year': 2025, '_auto': True}
#   {'value': 15612.12, 'MARS': 'widow', 'year': 2027, '_auto': True}
#   {'value': 15924.36, 'MARS': 'widow', 'year': 2028, '_auto': True}
#   {'value': 16239.66, 'MARS': 'widow', 'year': 2029, '_auto': True}
#   {'value': 16562.83, 'MARS': 'widow', 'year': 2030, '_auto': True}
# ])
  1. Define your own ordering function. CCC has a custom value type that is not orderable on its own, but with a custom ordering function it is:
# https://github.com/PSLmodels/Cost-of-Capital-Calculator/compare/master...hdoupe:pt-demo
class DepreciationRules(ma.Schema):
    # set some field validation ranges that can't set in JSON
    life = ma.fields.Float(validate=ma.validate.Range(min=0, max=100))
    method = ma.fields.String(
        validate=ma.validate.OneOf(choices=[
            "SL", "Expensing", "DB 150%", "DB 200%", "Economic"])
    )

    def cmp_funcs(self):
        return {
            "key": lambda x: (x["life"], x["method"])
        }
from ccc.parameters import DepreciationParams

dp = DepreciationParams()

dp.sel["asset"]["value"] == {"life": 100, "method": "SL"}

# QueryResult([
#   {'BEA_code': 'LAND', 'ADS_life': 100.0, 'system': 'GDS', 'GDS_life': 100.0, 'value': {'method': 'SL', 'life': 100.0}, 'major_asset_group': 'Land', 'year': 2020, 'asset_name': 'Land', 'minor_asset_group': 'Land'}
#   {'BEA_code': 'INV', 'ADS_life': 100.0, 'system': 'GDS', 'GDS_life': 100.0, 'value': {'method': 'SL', 'life': 100.0}, 'major_asset_group': 'Inventories', 'year': 2020, 'asset_name': 'Inventories', 'minor_asset_group': 'Inventories'}
# ])
hdoupe commented 4 years ago

(This PR is not backwards-compatible right now, but it will be before it is merged.)

jdebacker commented 4 years ago

@hdoupe This looks really good- thanks for your work on ParamTools!

hdoupe commented 4 years ago

@hdoupe This looks really good- thanks for your work on ParamTools!

Thanks @jdebacker!

hdoupe commented 4 years ago

With the latest commits:

Also, I tested the API both for feel and for correctness in Tax-Calculator where tests passed locally both as-is using the backwards compatible select module on the master branch and with the new api here: https://github.com/hdoupe/Tax-Calculator/commit/4cae756c15b260c405ab21d90a345ba114e21710.

TODO:

hdoupe commented 4 years ago

The latest commits improve performance to be better than current master branch. This is done by:

hdoupe commented 4 years ago

Just dropped the WIP tag on PR #114. I'm planning to merge once I add documentation for the new query features.

hdoupe commented 4 years ago

Latest commits add:

class Params(paramtools.Parameters): defaults = { "schema": { "labels": { "date": { "type": "date", "validators": { "range": {"min": "2020-01-01", "max": "2021-01-01", "step": {"months": 1}} } } }, }, "a": { "title": "A", "type": "int", "value": [{"date": "2020-01-01", "value": 2}, {"date": "2020-10-01", "value": 8},] }, "b": { "title": "B", "type": "float", "value": [{"date": "2020-01-01", "value": 10.5}] } } params = Params(label_to_extend="date") params.sel["a"]

Values([

{'date': datetime.date(2020, 1, 1), 'value': 2},

{'date': datetime.date(2020, 2, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 3, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 4, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 5, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 6, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 7, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 8, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 9, 1), 'value': 2, '_auto': True},

{'date': datetime.date(2020, 10, 1), 'value': 8},

{'date': datetime.date(2020, 11, 1), 'value': 8, '_auto': True},

{'date': datetime.date(2020, 12, 1), 'value': 8, '_auto': True},

{'date': datetime.date(2021, 1, 1), 'value': 8, '_auto': True},

])

hdoupe commented 4 years ago

Latest commits: