Data Dictionary of inputs and outputs

nmerket commented 1 year ago

One of the pain points we experience when importing ResStock runs into SightGlass (resstock.nrel.gov) is that the outputs and format of the outputs from ResStock frequently change. This causes our data processing for that to break and require many hours of manual updating every time we go to bring new data in.

It would be really helpful to have a data dictionary of the outputs ResStock produces meaning every column name (including the input and output columns) in the results.csv and timeseries parquet files. It should also include some flags about which are end uses to include in the sum vs aggregates (net or totals), units, other random outputs like load, emissions, etc. To keep this in sync, it should be verified against the CI runs of ResStock and if there is a discrepancy you get a big ❌ on your checks.

@rajeee @afontani @trynthink @ekpresent @joseph-robertson

joseph-robertson commented 1 year ago

Note that the current CI framework automatically populates tables of output column names based on results.csv, and uses them to build docs: https://github.com/NREL/resstock/blob/develop/test/util.py#L44-L68. In this way, column names are populated dynamically based on CI runs. Perhaps this is somewhat related to your issue here...?

nmerket commented 1 year ago

I see that in the docs now. I'd like to see this go further into a machine readable format (csv or json or something) with some additional metadata that I could reference in my downstream processing code.

Here's my wishlist:

Whether an column is energy, power, temperature, emissions, money, or something else
The units of that column
For energy columns, whether the column is an end use that will sum to the total, the total, or a net value
A way to tie columns and end uses between timeseries and annual outputs. Ideally they'd just have the same name. For the most part they do, but there are some exceptions and some format differences.
A notes field for when a column name is ambiguous or needs some context.

The metadata would require someone to maintain it rather than be automatically generated.

You could argue that a lot of that information is encoded in or can be inferred from the column names themselves. The problem I keep running into is that that's been and continues to be somewhat fluid. New outputs show up all the time and naming conventions change. That's what breaks things downstream for me. If I had a file like described above I could use it to figure out what to pay attention to for SightGlass and what can be ignored for that use case without the extensive special case catching code that we have to employ now.

afontani commented 1 year ago

@nmerket : Thanks for outlining this. This sounds like a standalone table (maybe in the resources/ folder) that gets joined to the outputs from CI to check and make sure every output is there. It might be hard to get every possible column if the end-use is very rare. We simulate 350 models on CI currently (250 from project national and 100 project testing), so if a column doesn't show up in those 350 models, the output might not get maintained very well.

One problem I see in the "way to tie columns and enduses between annual and timeseries" is that the units and names change between the timeseries and annual outputs. This could get cleaned up at some point, but may just want to itemize what we have now.

Is it worth doing for the characteristics too?

I could see a hand full of tests here on the file: 1) During the join, are columns in the results.csv also in the data dictionary? 2) Do the columns that are marked "sum to total" and "sum to net" actually sum to the "total" or "net" 3) Missing entries in the data dictionary (make sure every output has a unit)

I am probably missing some tests.

Maybe a table like this?

output name	annual output	timeseries output	description	output type	output units	sums to total site energy	sums to net site energy
report_simulation_output.end_use_electricity_clothes_dryer_m_btu	TRUE	FALSE	Electricity consumption of clothes dryers	energy	mbtu	TRUE	TRUE
report_simulation_output.end_use_natural_gas_heating_kbtu	FALSE	TRUE	Heating natural gas consumption	energy	kbtu	TRUE	TRUE

(@joseph-robertson edit: Changed "500 models on CI" to 350.)

afontani commented 1 year ago

From @joseph-robertson : Separate dictionaries for annual and timeseries? Map to annual from timeseries, map to timeseries from annual.

Test would be that the columns sum to the same value.

From @shorowit : HPXML could have a data dictionary. That would be a starting point for ResStock.

joseph-robertson commented 1 year ago

Inputs

[ ] Updates to resstock-estimation's source_report.csv
- Add columns:
- Description
- List of Options
- Level
- Move into project_national/resources/source_report.csv
[ ] Create new csv input data dictionary, based on source_report.csv with:
- Only columns of interest (i.e., those that will appear in RTD)
- Parameters underscore_cased like you'd find in the results.csv
[ ] Update tests to check CI results against the new input data dictionary
- Missing or new inputs? Test failure. Need to update input data dictionary
- Inputs match the data dictionary? Test passes. No updates needed.
[ ] Render input data dictionary in RTD

Outputs

[ ] (Manually) create new csv output data dictionary (organized by measure) of all ResStock outputs
- Add columns of interest, e.g.:
- Annual Name
- Annual Units (even though name implies them)
- Timeseries Name
- Timeseries Units
- Sums To
- Notes
[ ] Update tests to check CI results against the new output data dictionary
- New nonzero outputs? Test failure. Need to update output data dictionary
- Outputs match the data dictionary? Test passes. No updates needed.
[ ] Render output data dictionaries in RTD

NREL / resstock

Data Dictionary of inputs and outputs #1043

Maybe a table like this?