NREL / buildstockbatch

Other
22 stars 14 forks source link

Soften buildstock.csv validation for ComStock compatibility #384

Closed asparke2 closed 1 year ago

asparke2 commented 1 year ago

Soften buildstock.csv validation to throw warnings instead of errors for ComStock compatibility. The buildstock.csv file in ComStock contains some columns that are intermediate steps in the sampling process which are intentionally not used for simulation. We keep these intermediate sampling steps because they help explain the sampling process, and we join some of these into the results.csv data because they are useful to users of the dataset.

shorowit commented 1 year ago

I'm not exactly aware of the situation here and so may be completely misinterpreting what's going on here... But note that ResStock also has columns in the buildstock.csv that are not used for simulation and are only used for sampling. These are often called meta-parameters and they are still added to the options_lookup.tsv to prevent validation errors. (For example, HVAC Cooling Type doesn't provide direct arguments for an OpenStudio measure, rather it's only used as a dependency for downstream parameters in sampling.) Is it problematic for ComStock to likewise just add these meta-parameters to the options_lookup.tsv? Simply bypassing the validation can obviously mean that legitimate errors are missed.

asparke2 commented 1 year ago

@rajeee I think this looks good with your proposed change. For ComStock, we'll figure out how to add checks later

github-actions[bot] commented 1 year ago

File Coverage
All files 85% :white_check_mark:
base.py 89% :white_check_mark:
eagle.py 79% :white_check_mark:
exc.py 57% :white_check_mark:
local.py 50% :white_check_mark:
postprocessing.py 84% :white_check_mark:
utils.py 96% :white_check_mark:
sampler/base.py 79% :white_check_mark:
sampler/downselect.py 33% :white_check_mark:
sampler/precomputed.py 93% :white_check_mark:
sampler/residential_quota.py 61% :white_check_mark:
test/test_docker.py 33% :white_check_mark:
test/test_validation.py 97% :white_check_mark:
workflow_generator/base.py 90% :white_check_mark:
workflow_generator/commercial.py 53% :white_check_mark:
workflow_generator/residential_hpxml.py 84% :white_check_mark:

Minimum allowed coverage is 33%

Generated by :monkey: cobertura-action against 1bcecac3c02ccb19c8871f100259534aa44816b4

nmerket commented 1 year ago

@asparke2 This looks fine to me. I would've preferred for that validation to be inside the sampler class for better organization, but I see why it wasn't. My only comment echoes @shorowit's above. Any thoughts about that?

asparke2 commented 1 year ago

@nmerket @shorowit I took at look at implementing your suggestion of adding no-op options_lookup.tsv rows for the unused buildstock.csv columns. This would require adding ~70,000 additional rows to the options_lookup.tsv, ~60k of which are just the unique set of census tract IDs. I think this would make that file even more difficult to edit. So my preference at this point is to just skip the validation for ComStock.

shorowit commented 1 year ago

Would it make more sense to leave the validation on but issue a warning for each column that is not found? It may be 70k rows, but probably not that many columns. ResStock similarly has thousands of no-op rows in their options_lookup.tsv, so they might also be interested in removing them.

I think it would be a one-line change from: https://github.com/NREL/buildstockbatch/blob/42c2cb18b7edc8929d26f794480f7d0b831b7423/buildstockbatch/base.py#L341-L343 to:

            if column not in param_option_dict:
                logger.warning(f'Column {column} in buildstock_csv is not available in options_lookup.tsv')
                continue
rajeee commented 1 year ago

Would it make more sense to leave the validation on but issue a warning for each column that is not found? It may be 70k rows, but probably not that many columns. ResStock similarly has thousands of no-op rows in their options_lookup.tsv, so they might also be interested in removing them.

I think it would be a one-line change from:

https://github.com/NREL/buildstockbatch/blob/42c2cb18b7edc8929d26f794480f7d0b831b7423/buildstockbatch/base.py#L341-L343

to:

            if column not in param_option_dict:
                logger.warning(f'Column {column} in buildstock_csv is not available in options_lookup.tsv')
                continue

This would result in large number of false warnings which would drown other legitimate warnings (It's false warning if we expect to see them in correct options_lookup.tsv). It will also desensitize people from warnings. I am not a fan of this approach.

Can we have a middle ground? Change the validation so that, if a TSV is missing in options_lookup.tsv we ignore it and give it a pass. But if it is present, all of the options in the TSV should be present in the options_lookup.tsv?

asparke2 commented 1 year ago

@rajeee I think your suggestion makes sense: if a column is found, definitely useful to check that all of the column values are found in the options_lookup. For ComStock, this would at least catch the worst errors

shorowit commented 1 year ago

@rajeee Are you saying you want this?

            if column not in param_option_dict:
                continue

That would prevent false warnings, but also legitimate issues, no?

Maybe what's needed is a way for ResStock and ComStock to define which columns are allowed/expected to be missing. Could be hardcoded into BSB or defined in the yaml. If a column isn't found, and it's not on the list of approved columns that can be missing, then an error is still thrown.

rajeee commented 1 year ago

@rajeee Are you saying you want this?

            if column not in param_option_dict:
                continue

That would prevent false warnings, but also legitimate issues, no?

Maybe what's needed is a way for ResStock and ComStock to define which columns are allowed/expected to be missing. Could be hardcoded into BSB or defined in the yaml. If a column isn't found, and it's not on the list of approved columns that can be missing, then an error is still thrown.

Yes. That's what I was suggesting. True, it will miss some legitimate issues. It's a compromise.

We won't be losing a lot though. The TSVs we won't be validating are the ones that don't have any measure impact. They are basically just pass-through into the output. So, maybe it's not too bad to let the TSV themselves the be single source of truth on what those outputs are going to be. We might have to adjust the ResStock integrity check to allow for this.

asparke2 commented 1 year ago

@shorowit there are 57 intentionally unused/no-op columns in the ComStock buildstock.csv that show up as warnings (see below). Most of these are intermediate steps in the sampling that are useful. If we made the change you suggested, then I can reduce the number of warnings dramatically by adding most of the column/value pairs to the options_lookup.tsv and would only need to add ~800 rows to do so. The only remaining warnings would be for the census tract, county, PUMA, and weather file name columns.

buildstock.csv column Number of unique values in column
nhgis_tract_gisjoin 57,721
nhgis_county_gisjoin 3,049
resstock_county_id 3,023
nhgis_puma_gisjoin 2,351
resstock_puma_id 2,337
weather_file_2018 1,033
weather_file_TMY3 880
year_built 219
reeds_balancing_area 134
year_bin_of_last_hvac_replacement 83
state_id 51
state_abbreviation 51
state_name 51
year_bin_of_last_walls_replacement 42
year_bin_of_original_building_construction 42
year_bin_of_last_windows_replacement 42
year_bin_of_last_roof_replacement 42
year_bin_of_last_service_water_heating_replacement 36
year_bin_of_last_exterior_lighting_replacement 36
year_bin_of_last_interior_equipment_replacement 34
year_bin_of_last_interior_lighting_replacement 26
american_housing_survey_region 25
climate_zone_ashrae_2004 17
energy_code_in_force_during_last_walls_replacement 15
energy_code_in_force_during_last_windows_replacement 15
energy_code_followed_during_last_windows_replacement 15
energy_code_in_force_during_last_hvac_replacement 15
energy_code_in_force_during_original_building_construction 15
energy_code_in_force_during_last_roof_replacement 15
energy_code_in_force_during_last_exterior_lighting_replacement 14
plugload_sch_base_peak_ratio_type 13
energy_code_in_force_during_last_interior_equipment_replacement 13
energy_code_in_force_during_last_interior_lighting_replacement 13
energy_code_in_force_during_last_service_water_heating_replacement 13
ltg_sch_base_peak_ratio_type 13
building_shape 12
resstock_custom_region 11
census_division_name_recs 11
census_division_name 9
region 9
climate_zone_building_america 8
iso_region 8
ownership_status 5
purchase_input_responsibility 4
census_region_name 4
party_responsible_for_operation 4
energy_code_followed_during_last_interior_lighting_replacement 2
building_size_lighting_tech 2
energy_code_compliance_roof 1
energy_code_compliance_interior_equipment 1
energy_code_compliance_during_original_building_construction 1
energy_code_compliance_walls 1
energy_code_compliance_interior_lighting 1
energy_code_compliance_exterior_lighting 1
energy_code_compliance_service_water_heating 1
energy_code_compliance_windows 1
energy_code_compliance_hvac 1
Grand Total 71,592
shorowit commented 1 year ago

True, it will miss some legitimate issues. It's a compromise.

We won't be losing a lot though.

I guess I don't agree. This came up enough times that we added these error checks. It can easily come up when dealing with different branches/forks of ResStock with different sets of TSVs, and then throw in developers having to resolve complicated merge conflicts at times. These error checks catch problems up front and prevent obscure downstream error messages that only get seen once the simulations start.

rajeee commented 1 year ago

I guess I don't agree. This came up enough times that we added these error checks. It can easily come up when dealing with different branches/forks of ResStock with different sets of TSVs, and then throw in developers having to resolve complicated merge conflicts at times. These error checks catch problems up front and prevent obscure downstream error messages that only get seen once the simulations start.

Can we get these downstream errors even for TSVs that don't pass any arguments to any measure? Those would be the only TSVs we won't be validating with the proposed approach. Regardless, I do agree that there is value in catching these errors up front (maybe someone modified one of these no-op TSV but introduced some typo), but am curious to the extent of impact not doing so has.

shorowit commented 1 year ago

Can we get these downstream errors even for TSVs that don't pass any arguments to any measure? Those would be the only TSVs we won't be validating with the proposed approach.

We could get errors if a TSV should pass arguments to a measure but doesn't. But if its arguments aren't used, then I'd say you are correct that we can't get downstream errors for those TSVs. Sorry, I didn't realize your proposed approach would only validate TSVs with arguments; I guess I don't fully understand the proposal.

rajeee commented 1 year ago

We could get errors if a TSV should pass arguments to a measure but doesn't. But if its arguments aren't used, then I'd say you are correct that we can't get downstream errors for those TSVs. Sorry, I didn't realize your proposed approach would only validate TSVs with arguments; I guess I don't fully understand the proposal.

Looks like I spoke wrong on that one.

Let's break it down. image

Currently, ResStock considers F4 to be incorrect, but ComStock considers it to be correct.

There are two proposals here. Proposal 1 (from me above): Change ResStock's interpretation of F4 to correct from incorrect. This won't cause downstream errors and we get to have much smaller options_lookup.tsv. This would also allow comstock to have some validation. I proposed doing that by simply ignoring all TSVs that are not present in options_lookup.tsv. This will result in incorrect TSVs in cell D3 and D4 to be still validated. But you are rightly noting that it will miss the incorrect TSV in F3 in the table above. I assumed that TSVs that need pass arguments would only be incorrect by not having all the rows (D3) or passing wrong arguments (E4) and didn't think about not being present at all (F3). I believe this is less likely mode of error, but still possible error nonetheless. So, it's a valid shortcoming of this approach.

Proposal 2: (from Andrew originally) Keep status quo and skip validation for comstock. ComStock loses out in chance to catch incorrect TSVs early on, but ResStock gets to keep its convention of F4=incorrect and catch almost all cases of incorrect TSVs.

Note that we currently do not have a way to catch E3, E4 in buildstockbatch. Though it is options_lookup error not a characteristics TSV error.

So, if the shortcomings of proposal 1 is not acceptable for ResStock, maybe we could apply proposal 1 only to ComStock. That way, the validator for ResStock catches D3, D4, F3 and F4 and for ComStock it will catch D3 and D4 and miss out on F3 (F4 is correct for ComStock.)

shorowit commented 1 year ago

Proposal 3: Define which columns are allowed/expected to be missing for ResStock and ComStock. Could be ~hardcoded into BSB or~ defined in the yaml or some other text file. If a column isn't found, and it's not on the list of approved columns that can be missing, then an error is thrown.

asparke2 commented 1 year ago

Please do not hard-code columns into buildstockbatch. It is difficult for ComStock to keep up with buildstockbatch changes as-is, the more we can avoid having new BSB versions due to added/removed/changed buildstock.csv or options_lookup.tsv changes.

rajeee commented 1 year ago

Proposal 3: Define which columns are allowed/expected to be missing for ResStock and ComStock. Could be ~hardcoded into BSB or~ defined in the yaml or some other text file. If a column isn't found, and it's not on the list of approved columns that can be missing, then an error is thrown.

Building on this proposal, instead of maintaining two different files (option_lookup.tsv and the proposed new text file) to keep track of required/not-required TSVs, can we introduce a wildcard "*" in options_lookup.tsv to refer to any option. So, option_lookup.tsv would look like:

required_char1, option_1, <measure_arguments>
required_char1, option_2, <measure_arguments>
...
not_required_char_1,*, <no measure arguments>
not_required_char_2,*, <no measure arguments>

This will only increase the number of rows in option_lookup.tsv slightly for ComStock (only the TSV name for the currently missing TSVs needs to be added instead of one row for each of its options).

Maybe this is even more complicating things, but I kinda prefer this over maintaining one extra text file. Thoughts?

asparke2 commented 1 year ago

@rajeee I'm happy with either your proposal of the no-op columns being defined with an * or listing out the no-op options for the columns. The main thing for ComStock is not throwing blocking errors when a column is found in the buildstock.csv but not in the options_lookup.tsv

shorowit commented 1 year ago

I like the idea of using wildcards, @rajeee! Seems like a nice compromise.

nmerket commented 1 year ago

Wildcards sound like a great solution. I also agree that we should avoid hardcoding things into buildstockbatch.

rajeee commented 1 year ago

Glad we reached an agreement with the wildcard approach!

@rajeee I'm happy with either your proposal of the no-op columns being defined with an * or listing out the no-op options for the columns. The main thing for ComStock is not throwing blocking errors when a column is found in the buildstock.csv but not in the options_lookup.tsv

You are planning to add the no-op TSV into options_lookup with * for ComStock after we implement this approach, right? If so, we can still throw blocking error when a column is found in buildstock.csv but not in options_lookup.tsv. Or, are you still considering to leave them out?

asparke2 commented 1 year ago

Glad we reached an agreement with the wildcard approach!

@rajeee I'm happy with either your proposal of the no-op columns being defined with an * or listing out the no-op options for the columns. The main thing for ComStock is not throwing blocking errors when a column is found in the buildstock.csv but not in the options_lookup.tsv

You are planning to add the no-op TSV into options_lookup with * for ComStock after we implement this approach, right? If so, we can still throw blocking error when a column is found in buildstock.csv but not in options_lookup.tsv. Or, are you still considering to leave them out?

Yeah, if the * becomes a valid option then I will list all the no-op columns in the buildstock.csv

rajeee commented 1 year ago

@asparke2 I added the wildcard feature. Please check.

nmerket commented 1 year ago

It looks like we're already checking and validating whether the sampler exists here.

https://github.com/NREL/buildstockbatch/blob/baf2eb6c9cee8a75939a89d9af703c7e41d6c295/buildstockbatch/base.py#L309-L315

I removed the redundant enumerations in the yaml schema.

@asparke2 I think this one looks good. I can merge if it comes back happy from CI. Do you want to do any additional testing of ComStock with it before I merge?

asparke2 commented 1 year ago

@nmerket @rajeee ComStock uses the string "NA" in the buildstock.csv and options_lookup.tsv. I believe that we never noticed this before because the measures that use "NA" interpret as a no-op, but now that all options are being validated, I think that changing read_csv to use: na_values=list(default_na_values - {"None", "NA"}) is the solution.

I don't know if ResStock uses "NA" anywhere.

asparke2 commented 1 year ago

@nmerket @rajeee the simulations are all failing because there are no valid options for the intentionally-unused parameters in the buildstock.csv file.

Intentionally-unused column in buildstock.csv: american_housing_survey_region

Error in out.osw file:

"step_errors" : 
[
   "Could not find parameter 'american_housing_survey_region' and option 'CBSA Riverside-San Bernardino-Ontario, CA' in /lib/resources/options_lookup.tsv.",

I think we will need to change this ComStock code and this ResStock code to know about the wildcard * and ignore that parameter for this code to work correctly.

asparke2 commented 1 year ago

This also makes me wonder if we should change * to something more descriptive like parameter_not_used_in_model_generation so that people editing the options_lookup.tsv are clearer about what this does and also the code in ResStock and ComStock that's parsing this are more obvious.