NREL / resstock

Highly granular modeling of residential building stocks at national, regional, and local scales using OpenStudio/EnergyPlus.
https://resstock.nrel.gov
Other
104 stars 51 forks source link

Data Dictionary of inputs and outputs #1043

Closed nmerket closed 1 year ago

nmerket commented 1 year ago

One of the pain points we experience when importing ResStock runs into SightGlass (resstock.nrel.gov) is that the outputs and format of the outputs from ResStock frequently change. This causes our data processing for that to break and require many hours of manual updating every time we go to bring new data in.

It would be really helpful to have a data dictionary of the outputs ResStock produces meaning every column name (including the input and output columns) in the results.csv and timeseries parquet files. It should also include some flags about which are end uses to include in the sum vs aggregates (net or totals), units, other random outputs like load, emissions, etc. To keep this in sync, it should be verified against the CI runs of ResStock and if there is a discrepancy you get a big ❌ on your checks.

@rajeee @afontani @trynthink @ekpresent @joseph-robertson

joseph-robertson commented 1 year ago

Note that the current CI framework automatically populates tables of output column names based on results.csv, and uses them to build docs: https://github.com/NREL/resstock/blob/develop/test/util.py#L44-L68. In this way, column names are populated dynamically based on CI runs. Perhaps this is somewhat related to your issue here...?

nmerket commented 1 year ago

I see that in the docs now. I'd like to see this go further into a machine readable format (csv or json or something) with some additional metadata that I could reference in my downstream processing code.

Here's my wishlist:

The metadata would require someone to maintain it rather than be automatically generated.

You could argue that a lot of that information is encoded in or can be inferred from the column names themselves. The problem I keep running into is that that's been and continues to be somewhat fluid. New outputs show up all the time and naming conventions change. That's what breaks things downstream for me. If I had a file like described above I could use it to figure out what to pay attention to for SightGlass and what can be ignored for that use case without the extensive special case catching code that we have to employ now.

afontani commented 1 year ago

@nmerket : Thanks for outlining this. This sounds like a standalone table (maybe in the resources/ folder) that gets joined to the outputs from CI to check and make sure every output is there. It might be hard to get every possible column if the end-use is very rare. We simulate 350 models on CI currently (250 from project national and 100 project testing), so if a column doesn't show up in those 350 models, the output might not get maintained very well.

One problem I see in the "way to tie columns and enduses between annual and timeseries" is that the units and names change between the timeseries and annual outputs. This could get cleaned up at some point, but may just want to itemize what we have now.

Is it worth doing for the characteristics too?

I could see a hand full of tests here on the file: 1) During the join, are columns in the results.csv also in the data dictionary? 2) Do the columns that are marked "sum to total" and "sum to net" actually sum to the "total" or "net" 3) Missing entries in the data dictionary (make sure every output has a unit)

I am probably missing some tests.

Maybe a table like this?

output name annual output timeseries output description output type output units sums to total site energy sums to net site energy
report_simulation_output.end_use_electricity_clothes_dryer_m_btu TRUE FALSE Electricity consumption of clothes dryers energy mbtu TRUE TRUE
report_simulation_output.end_use_natural_gas_heating_kbtu FALSE TRUE Heating natural gas consumption energy kbtu TRUE TRUE

(@joseph-robertson edit: Changed "500 models on CI" to 350.)

afontani commented 1 year ago

From @joseph-robertson : Separate dictionaries for annual and timeseries? Map to annual from timeseries, map to timeseries from annual.

Test would be that the columns sum to the same value.

From @shorowit : HPXML could have a data dictionary. That would be a starting point for ResStock.

joseph-robertson commented 1 year ago

Inputs

Outputs