Closed asparke2 closed 1 year ago
I'm not exactly aware of the situation here and so may be completely misinterpreting what's going on here... But note that ResStock also has columns in the buildstock.csv that are not used for simulation and are only used for sampling. These are often called meta-parameters and they are still added to the options_lookup.tsv to prevent validation errors. (For example, HVAC Cooling Type doesn't provide direct arguments for an OpenStudio measure, rather it's only used as a dependency for downstream parameters in sampling.) Is it problematic for ComStock to likewise just add these meta-parameters to the options_lookup.tsv? Simply bypassing the validation can obviously mean that legitimate errors are missed.
@rajeee I think this looks good with your proposed change. For ComStock, we'll figure out how to add checks later
File | Coverage | |
---|---|---|
All files | 85% |
:white_check_mark: |
base.py | 89% |
:white_check_mark: |
eagle.py | 79% |
:white_check_mark: |
exc.py | 57% |
:white_check_mark: |
local.py | 50% |
:white_check_mark: |
postprocessing.py | 84% |
:white_check_mark: |
utils.py | 96% |
:white_check_mark: |
sampler/base.py | 79% |
:white_check_mark: |
sampler/downselect.py | 33% |
:white_check_mark: |
sampler/precomputed.py | 93% |
:white_check_mark: |
sampler/residential_quota.py | 61% |
:white_check_mark: |
test/test_docker.py | 33% |
:white_check_mark: |
test/test_validation.py | 97% |
:white_check_mark: |
workflow_generator/base.py | 90% |
:white_check_mark: |
workflow_generator/commercial.py | 53% |
:white_check_mark: |
workflow_generator/residential_hpxml.py | 84% |
:white_check_mark: |
Minimum allowed coverage is 33%
Generated by :monkey: cobertura-action against 1bcecac3c02ccb19c8871f100259534aa44816b4
@asparke2 This looks fine to me. I would've preferred for that validation to be inside the sampler class for better organization, but I see why it wasn't. My only comment echoes @shorowit's above. Any thoughts about that?
@nmerket @shorowit I took at look at implementing your suggestion of adding no-op options_lookup.tsv rows for the unused buildstock.csv columns. This would require adding ~70,000 additional rows to the options_lookup.tsv, ~60k of which are just the unique set of census tract IDs. I think this would make that file even more difficult to edit. So my preference at this point is to just skip the validation for ComStock.
Would it make more sense to leave the validation on but issue a warning for each column that is not found? It may be 70k rows, but probably not that many columns. ResStock similarly has thousands of no-op rows in their options_lookup.tsv, so they might also be interested in removing them.
I think it would be a one-line change from: https://github.com/NREL/buildstockbatch/blob/42c2cb18b7edc8929d26f794480f7d0b831b7423/buildstockbatch/base.py#L341-L343 to:
if column not in param_option_dict:
logger.warning(f'Column {column} in buildstock_csv is not available in options_lookup.tsv')
continue
Would it make more sense to leave the validation on but issue a warning for each column that is not found? It may be 70k rows, but probably not that many columns. ResStock similarly has thousands of no-op rows in their options_lookup.tsv, so they might also be interested in removing them.
I think it would be a one-line change from:
to:
if column not in param_option_dict: logger.warning(f'Column {column} in buildstock_csv is not available in options_lookup.tsv') continue
This would result in large number of false warnings which would drown other legitimate warnings (It's false warning if we expect to see them in correct options_lookup.tsv). It will also desensitize people from warnings. I am not a fan of this approach.
Can we have a middle ground? Change the validation so that, if a TSV is missing in options_lookup.tsv we ignore it and give it a pass. But if it is present, all of the options in the TSV should be present in the options_lookup.tsv?
@rajeee I think your suggestion makes sense: if a column is found, definitely useful to check that all of the column values are found in the options_lookup. For ComStock, this would at least catch the worst errors
@rajeee Are you saying you want this?
if column not in param_option_dict:
continue
That would prevent false warnings, but also legitimate issues, no?
Maybe what's needed is a way for ResStock and ComStock to define which columns are allowed/expected to be missing. Could be hardcoded into BSB or defined in the yaml. If a column isn't found, and it's not on the list of approved columns that can be missing, then an error is still thrown.
@rajeee Are you saying you want this?
if column not in param_option_dict: continue
That would prevent false warnings, but also legitimate issues, no?
Maybe what's needed is a way for ResStock and ComStock to define which columns are allowed/expected to be missing. Could be hardcoded into BSB or defined in the yaml. If a column isn't found, and it's not on the list of approved columns that can be missing, then an error is still thrown.
Yes. That's what I was suggesting. True, it will miss some legitimate issues. It's a compromise.
We won't be losing a lot though. The TSVs we won't be validating are the ones that don't have any measure impact. They are basically just pass-through into the output. So, maybe it's not too bad to let the TSV themselves the be single source of truth on what those outputs are going to be. We might have to adjust the ResStock integrity check to allow for this.
@shorowit there are 57 intentionally unused/no-op columns in the ComStock buildstock.csv that show up as warnings (see below). Most of these are intermediate steps in the sampling that are useful. If we made the change you suggested, then I can reduce the number of warnings dramatically by adding most of the column/value pairs to the options_lookup.tsv and would only need to add ~800 rows to do so. The only remaining warnings would be for the census tract, county, PUMA, and weather file name columns.
buildstock.csv column | Number of unique values in column |
---|---|
nhgis_tract_gisjoin | 57,721 |
nhgis_county_gisjoin | 3,049 |
resstock_county_id | 3,023 |
nhgis_puma_gisjoin | 2,351 |
resstock_puma_id | 2,337 |
weather_file_2018 | 1,033 |
weather_file_TMY3 | 880 |
year_built | 219 |
reeds_balancing_area | 134 |
year_bin_of_last_hvac_replacement | 83 |
state_id | 51 |
state_abbreviation | 51 |
state_name | 51 |
year_bin_of_last_walls_replacement | 42 |
year_bin_of_original_building_construction | 42 |
year_bin_of_last_windows_replacement | 42 |
year_bin_of_last_roof_replacement | 42 |
year_bin_of_last_service_water_heating_replacement | 36 |
year_bin_of_last_exterior_lighting_replacement | 36 |
year_bin_of_last_interior_equipment_replacement | 34 |
year_bin_of_last_interior_lighting_replacement | 26 |
american_housing_survey_region | 25 |
climate_zone_ashrae_2004 | 17 |
energy_code_in_force_during_last_walls_replacement | 15 |
energy_code_in_force_during_last_windows_replacement | 15 |
energy_code_followed_during_last_windows_replacement | 15 |
energy_code_in_force_during_last_hvac_replacement | 15 |
energy_code_in_force_during_original_building_construction | 15 |
energy_code_in_force_during_last_roof_replacement | 15 |
energy_code_in_force_during_last_exterior_lighting_replacement | 14 |
plugload_sch_base_peak_ratio_type | 13 |
energy_code_in_force_during_last_interior_equipment_replacement | 13 |
energy_code_in_force_during_last_interior_lighting_replacement | 13 |
energy_code_in_force_during_last_service_water_heating_replacement | 13 |
ltg_sch_base_peak_ratio_type | 13 |
building_shape | 12 |
resstock_custom_region | 11 |
census_division_name_recs | 11 |
census_division_name | 9 |
region | 9 |
climate_zone_building_america | 8 |
iso_region | 8 |
ownership_status | 5 |
purchase_input_responsibility | 4 |
census_region_name | 4 |
party_responsible_for_operation | 4 |
energy_code_followed_during_last_interior_lighting_replacement | 2 |
building_size_lighting_tech | 2 |
energy_code_compliance_roof | 1 |
energy_code_compliance_interior_equipment | 1 |
energy_code_compliance_during_original_building_construction | 1 |
energy_code_compliance_walls | 1 |
energy_code_compliance_interior_lighting | 1 |
energy_code_compliance_exterior_lighting | 1 |
energy_code_compliance_service_water_heating | 1 |
energy_code_compliance_windows | 1 |
energy_code_compliance_hvac | 1 |
Grand Total | 71,592 |
True, it will miss some legitimate issues. It's a compromise.
We won't be losing a lot though.
I guess I don't agree. This came up enough times that we added these error checks. It can easily come up when dealing with different branches/forks of ResStock with different sets of TSVs, and then throw in developers having to resolve complicated merge conflicts at times. These error checks catch problems up front and prevent obscure downstream error messages that only get seen once the simulations start.
I guess I don't agree. This came up enough times that we added these error checks. It can easily come up when dealing with different branches/forks of ResStock with different sets of TSVs, and then throw in developers having to resolve complicated merge conflicts at times. These error checks catch problems up front and prevent obscure downstream error messages that only get seen once the simulations start.
Can we get these downstream errors even for TSVs that don't pass any arguments to any measure? Those would be the only TSVs we won't be validating with the proposed approach. Regardless, I do agree that there is value in catching these errors up front (maybe someone modified one of these no-op TSV but introduced some typo), but am curious to the extent of impact not doing so has.
Can we get these downstream errors even for TSVs that don't pass any arguments to any measure? Those would be the only TSVs we won't be validating with the proposed approach.
We could get errors if a TSV should pass arguments to a measure but doesn't. But if its arguments aren't used, then I'd say you are correct that we can't get downstream errors for those TSVs. Sorry, I didn't realize your proposed approach would only validate TSVs with arguments; I guess I don't fully understand the proposal.
We could get errors if a TSV should pass arguments to a measure but doesn't. But if its arguments aren't used, then I'd say you are correct that we can't get downstream errors for those TSVs. Sorry, I didn't realize your proposed approach would only validate TSVs with arguments; I guess I don't fully understand the proposal.
Looks like I spoke wrong on that one.
Let's break it down.
Currently, ResStock considers F4 to be incorrect, but ComStock considers it to be correct.
There are two proposals here. Proposal 1 (from me above): Change ResStock's interpretation of F4 to correct from incorrect. This won't cause downstream errors and we get to have much smaller options_lookup.tsv. This would also allow comstock to have some validation. I proposed doing that by simply ignoring all TSVs that are not present in options_lookup.tsv. This will result in incorrect TSVs in cell D3 and D4 to be still validated. But you are rightly noting that it will miss the incorrect TSV in F3 in the table above. I assumed that TSVs that need pass arguments would only be incorrect by not having all the rows (D3) or passing wrong arguments (E4) and didn't think about not being present at all (F3). I believe this is less likely mode of error, but still possible error nonetheless. So, it's a valid shortcoming of this approach.
Proposal 2: (from Andrew originally) Keep status quo and skip validation for comstock. ComStock loses out in chance to catch incorrect TSVs early on, but ResStock gets to keep its convention of F4=incorrect and catch almost all cases of incorrect TSVs.
Note that we currently do not have a way to catch E3, E4 in buildstockbatch. Though it is options_lookup error not a characteristics TSV error.
So, if the shortcomings of proposal 1 is not acceptable for ResStock, maybe we could apply proposal 1 only to ComStock. That way, the validator for ResStock catches D3, D4, F3 and F4 and for ComStock it will catch D3 and D4 and miss out on F3 (F4 is correct for ComStock.)
Proposal 3: Define which columns are allowed/expected to be missing for ResStock and ComStock. Could be ~hardcoded into BSB or~ defined in the yaml or some other text file. If a column isn't found, and it's not on the list of approved columns that can be missing, then an error is thrown.
Please do not hard-code columns into buildstockbatch. It is difficult for ComStock to keep up with buildstockbatch changes as-is, the more we can avoid having new BSB versions due to added/removed/changed buildstock.csv or options_lookup.tsv changes.
Proposal 3: Define which columns are allowed/expected to be missing for ResStock and ComStock. Could be ~hardcoded into BSB or~ defined in the yaml or some other text file. If a column isn't found, and it's not on the list of approved columns that can be missing, then an error is thrown.
Building on this proposal, instead of maintaining two different files (option_lookup.tsv and the proposed new text file) to keep track of required/not-required TSVs, can we introduce a wildcard "*" in options_lookup.tsv to refer to any option. So, option_lookup.tsv would look like:
required_char1, option_1, <measure_arguments>
required_char1, option_2, <measure_arguments>
...
not_required_char_1,*, <no measure arguments>
not_required_char_2,*, <no measure arguments>
This will only increase the number of rows in option_lookup.tsv slightly for ComStock (only the TSV name for the currently missing TSVs needs to be added instead of one row for each of its options).
Maybe this is even more complicating things, but I kinda prefer this over maintaining one extra text file. Thoughts?
@rajeee I'm happy with either your proposal of the no-op columns being defined with an * or listing out the no-op options for the columns. The main thing for ComStock is not throwing blocking errors when a column is found in the buildstock.csv but not in the options_lookup.tsv
I like the idea of using wildcards, @rajeee! Seems like a nice compromise.
Wildcards sound like a great solution. I also agree that we should avoid hardcoding things into buildstockbatch.
Glad we reached an agreement with the wildcard approach!
@rajeee I'm happy with either your proposal of the no-op columns being defined with an * or listing out the no-op options for the columns. The main thing for ComStock is not throwing blocking errors when a column is found in the buildstock.csv but not in the options_lookup.tsv
You are planning to add the no-op TSV into options_lookup with * for ComStock after we implement this approach, right? If so, we can still throw blocking error when a column is found in buildstock.csv but not in options_lookup.tsv. Or, are you still considering to leave them out?
Glad we reached an agreement with the wildcard approach!
@rajeee I'm happy with either your proposal of the no-op columns being defined with an * or listing out the no-op options for the columns. The main thing for ComStock is not throwing blocking errors when a column is found in the buildstock.csv but not in the options_lookup.tsv
You are planning to add the no-op TSV into options_lookup with * for ComStock after we implement this approach, right? If so, we can still throw blocking error when a column is found in buildstock.csv but not in options_lookup.tsv. Or, are you still considering to leave them out?
Yeah, if the * becomes a valid option then I will list all the no-op columns in the buildstock.csv
@asparke2 I added the wildcard feature. Please check.
It looks like we're already checking and validating whether the sampler exists here.
I removed the redundant enumerations in the yaml schema.
@asparke2 I think this one looks good. I can merge if it comes back happy from CI. Do you want to do any additional testing of ComStock with it before I merge?
@nmerket @rajeee ComStock uses the string "NA" in the buildstock.csv and options_lookup.tsv. I believe that we never noticed this before because the measures that use "NA" interpret as a no-op, but now that all options are being validated, I think that changing read_csv
to use: na_values=list(default_na_values - {"None", "NA"})
is the solution.
I don't know if ResStock uses "NA" anywhere.
@nmerket @rajeee the simulations are all failing because there are no valid options for the intentionally-unused parameters in the buildstock.csv
file.
Intentionally-unused column in buildstock.csv
: american_housing_survey_region
Error in out.osw
file:
"step_errors" :
[
"Could not find parameter 'american_housing_survey_region' and option 'CBSA Riverside-San Bernardino-Ontario, CA' in /lib/resources/options_lookup.tsv.",
I think we will need to change this ComStock code and this ResStock code to know about the wildcard *
and ignore that parameter for this code to work correctly.
This also makes me wonder if we should change *
to something more descriptive like parameter_not_used_in_model_generation
so that people editing the options_lookup.tsv
are clearer about what this does and also the code in ResStock and ComStock that's parsing this are more obvious.
Soften buildstock.csv validation to throw warnings instead of errors for ComStock compatibility. The buildstock.csv file in ComStock contains some columns that are intermediate steps in the sampling process which are intentionally not used for simulation. We keep these intermediate sampling steps because they help explain the sampling process, and we join some of these into the results.csv data because they are useful to users of the dataset.