catalyst-cooperative / pudl-catalog

An Intake catalog for distributing open energy system data liberated by Catalyst Cooperative.
https://catalyst.coop/pudl/
MIT License
9 stars 2 forks source link

Constrain allowable years and states for filtering #9

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

The EPA CEMS dataset is composed of ~1300 row groups, each containing a unique combination of year and state to allow efficient pushdown filtering by time and location. Only a certain range of years (1995-2020) and set of state abbreviations (continental US plus DC) are valid for filtering. It would be nice if we could at least suggest, and preferably require that users only attempt to filter with valid values, so that if they ask for something outside of the allowable values they get an error, rather than waiting a long time for a query that won't give them anything useful.

Is this easy to set up with the intake catalog? Can we designate an allowable set of values for years and states to be used as filters? How are user parameters meant to be used? I've seen that you can enumerate allowable values there, but they seem only to be for use in Jinja templating of the filenames, and not for things like the filters.

zaneselvans commented 2 years ago

This doesn't appear to be a way we can use the parameters -- they seem to be able only to select a single file path at a time. To pass the DNF filters through to Dask/Pandas we won't be able to constrain the allowable values. See this comment and this example

zaneselvans commented 2 years ago

Closing this as it doesn't seem to be workable.