DOE Demand Flexibility Data Intake

dongqi-wu commented 2 years ago

1. DOE Flexibility Data

Source

Name: Demand Response Across the Continental US for 2006
Author: Seungwook Ma, Office of Energy Efficiency, Renewable Energy, DOE
Description: This project estimates hourly demand response availability across the continental U.S. for the year 2006. The resulting data set is disaggregated by balancing authority area, end use, and grid application. End uses include 14 categories across residential, commercial, industrial and municipal sectors. Grid applications include the 5 bulk power system services of regulation reserve, flexibility (or ramping) reserve, contingency reserve, energy, and capacity. Based on the physical requirements of the various bulk power system services and the estimated end use electric load shapes, potential availability of demand response is calculated and provided as a series of csv files.
Source: https://data.openei.org/submissions/180
Exact source location: https://data.openei.org/files/180/2006weatherentireusdrfilters.tar.zip
Terms (if specified): https://creativecommons.org/licenses/by/4.0/

Destination

Modifications to source files(s): new data processing files in PR #282
Location: PreREISE/gather/flexibilitydata/doe/cleaned.csv

General Purpose

The data is used to calculate a demand flexibility number for buses in the synthetic grid model by matching their geographical location to their corresponding LSE specified in the flexibility data.

2. USA Census Bureau Population Data

Source

Name: County Population Totals: 2010-2019
Author: U. S. Census Bureau
Description: This page features all the files containing Vintage 2019 county population totals and components of change.
Source: https://www.census.gov/data/datasets/time-series/demo/popest/2010s-counties-total.html#par_textimage_70769902
Exact source location: https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/totals/co-est2020-alldata.csv
Terms (if specified): https://www.census.gov/data/developers/about/terms-of-service.html

Destination

Modifications to source files(s): new data processing files in PR #282
Location: PreREISE/gather/flexibilitydata/raw/county_population.csv

General Purpose

The data is used to to determine the weight share of ZIP codes and FIPS based on their population.

3. FIPS and ZIP Conversion Data

Source

Name: HUD USPS ZIP CODE CROSSWALK FILES
Author: Ron Wilson and Alexander Din
Description: One of the many challenges that social science researchers and practitioners face is the difficulty of relating United States Postal Service (USPS) ZIP codes to Census Bureau geographies. There are valuable data available only at the ZIP code level that, when combined with demographic data tabulated at various Census geography levels, could open up new avenues of exploration.
Source: https://www.huduser.gov/portal/datasets/usps_crosswalk.html
Exact source location: https://www.huduser.gov/portal/datasets/usps/COUNTY_ZIP_122021.xlsx
Terms (if specified): Not specified
Destination
Modifications to source files(s): new data processing files in PR #282
Location: PreREISE/gather/flexibilitydata/raw/county_to_zip.xlsx

General Purpose

The data is used to compute a bi-directional mapping to be stored as a cache file between ZIP and FIPS regions

4. LSE Service Region Data

Source

Name: U.S. Electric Utility Companies and Rates: Look-up by Zipcode (2019)
Author: Jay Huggins, NREL
Description: This dataset, compiled by NREL using data from ABB, the Velocity Suite and the U.S. Energy Information Administration dataset 861, provides average residential, commercial and industrial electricity rates with likely zip codes for both investor owned utilities (IOU) and non-investor owned utilities. Note: the files include average rates for each utility (not average rates per zip code), but not the detailed rate structure data found in the OpenEI U.S. Utility Rate Database.
Source: https://catalog.data.gov/dataset/u-s-electric-utility-companies-and-rates-look-up-by-zipcode-2019
Exact source location: https://data.openei.org/files/4042/non_iou_zipcodes_2019.csv; https://data.openei.org/files/4042/iou_zipcodes_2019.csv
Terms (if specified): http://opendefinition.org/licenses/cc-by/
Destination
Modifications to source files(s): new data processing files in PR #282
Location: PreREISE/gather/flexibilitydata/raw/iou_zipcodes_2019.csv; PreREISE/gather/flexibilitydata/raw/non_iou_zipcodes_2019.csv;

General Purpose

The data is used to look up the service region (list of ZIP codes) for each LSE in the USA.

5. County and State FIPS codes

Source

Name: County and State FIPS codes
Author: Kieran Healy, Duke University
Description: Three CSV files with some basic FIPS identifying information for US States (state_fips_master.csv), Counties (county_fips_master.csv), and both together (state_and_county_fips_master.csv). I got sick of constantly having to write code to match on one or other of these identifiers in order to merge data files (e.g. for maps). So this can serve as a basis for harmonizing files that use one, some, or some variant of these identifiers. For example, sometimes leading zeros are omitted in the FIPS, sometimes not; sometimes the FIPS is coded in data as one number, sometimes as a character-vector of digits, sometimes as two separate state and county numbers, and so on. The Census also has its own supra-state units (regions and divisions). These files make it easier to merge and match to data indexed in one or other of these ways.
Source: https://github.com/kjhealy/fips-codes/
Exact source location: https://github.com/kjhealy/fips-codes/raw/master/county_fips_master.csv;
Terms (if specified): N/A
Destination
Modifications to source files(s): new data processing files in PR #282
Location: PreREISE/gather/flexibilitydata/raw/county_fips_master.csv;

General Purpose

The data is used to look up the FIPS number of each county by name

danielolsen commented 2 years ago

Location: PreREISE/gather/flexibilitydata/doe

Since there are many files that we're adding in #282, and the openei.org site doesn't list exactly which files are present within each of the zipped folders at https://data.openei.org/submissions/180, I think it would be useful to elaborate on which zipped folder(s) we're using and which file(s) within the folder(s) that we're adding to the doe folder.

Are there other files within #282 that comes from other sources besides that openei.org submission?

dongqi-wu commented 2 years ago

Location: PreREISE/gather/flexibilitydata/doe

Since there are many files that we're adding in #282, and the openei.org site doesn't list exactly which files are present within each of the zipped folders at https://data.openei.org/submissions/180, I think it would be useful to elaborate on which zipped folder(s) we're using and which file(s) within the folder(s) that we're adding to the doe folder.

Are there other files within #282 that comes from other sources besides that openei.org submission?

I have updated the Issue to reflect the required changes.

danielolsen commented 2 years ago

Looking at the HUD USPS ZIP Code Crosswalk Files, I see a few differences between the file available from huduser.gov and the file that's in #282: a different number of data rows, a different order of the data, and a different sheet name that seems to imply an older version of the data (COUNTY_ZIP_092021 in #282 vs. SQLT0005 in the file from huduser.gov, which seems to be more recent based on the filename). If this is just a data versioning issue, we should update the version of the file that we're using to the latest.

Regarding Modifications to source files(s): we're not modifying this at all, right? In #282 I only see us reading in this data, not making any changes.

For single-sheet data tables like this, CSV files are generally more convenient to work with.

dongqi-wu commented 2 years ago

Looking at the HUD USPS ZIP Code Crosswalk Files, I see a few differences between the file available from huduser.gov and the file that's in #282: a different number of data rows, a different order of the data, and a different sheet name that seems to imply an older version of the data (COUNTY_ZIP_092021 in #282 vs. SQLT0005 in the file from huduser.gov, which seems to be more recent based on the filename). If this is just a data versioning issue, we should update the version of the file that we're using to the latest.

Regarding Modifications to source files(s): we're not modifying this at all, right? In #282 I only see us reading in this data, not making any changes.

For single-sheet data tables like this, CSV files are generally more convenient to work with.

It seems the data gets updated quarterly. Ideally the cache files and the EIAID to bus mapping should also be updated quarterly using the newest data.
Right. The raw data is only used to produce the cache .pkl files.
Do you think it is better to convert the xlsx files to csv first? the read_excel() function in pandas functions quite similarly as read_csv() so it was not that different.

danielolsen commented 2 years ago

Looking at the HUD USPS ZIP Code Crosswalk Files, I see a few differences between the file available from huduser.gov and the file that's in #282: a different number of data rows, a different order of the data, and a different sheet name that seems to imply an older version of the data (COUNTY_ZIP_092021 in #282 vs. SQLT0005 in the file from huduser.gov, which seems to be more recent based on the filename). If this is just a data versioning issue, we should update the version of the file that we're using to the latest. Regarding Modifications to source files(s): we're not modifying this at all, right? In #282 I only see us reading in this data, not making any changes. For single-sheet data tables like this, CSV files are generally more convenient to work with.
1. It seems the data gets updated quarterly. Ideally the cache files and the EIAID to bus mapping should also be updated quarterly using the newest data.

2. Right. The raw data is only used to produce the cache .pkl files.

3. Do you think it is better to convert the xlsx files to csv first? the read_excel() function in pandas functions quite similarly as read_csv() so it was not that different.

We don't necessarily need to be using the latest at all times, since I don't expect the underlying data to change too much, but we should at least be consistent between the files that are in #282 and the places that we say we got the files from.
If we aren't modifying this data, then we can just say Modifications to source files(s): None. (for this one and the rest as well)
CSVs are preferrable. Pandas has similar functions for loading both, but for other uses .xlsx files are less convenient: e.g. examining the difference between two versions in Git, viewing the files in the browser on Github, etc.

dongqi-wu commented 2 years ago

Looking at the HUD USPS ZIP Code Crosswalk Files, I see a few differences between the file available from huduser.gov and the file that's in #282: a different number of data rows, a different order of the data, and a different sheet name that seems to imply an older version of the data (COUNTY_ZIP_092021 in #282 vs. SQLT0005 in the file from huduser.gov, which seems to be more recent based on the filename). If this is just a data versioning issue, we should update the version of the file that we're using to the latest. Regarding Modifications to source files(s): we're not modifying this at all, right? In #282 I only see us reading in this data, not making any changes. For single-sheet data tables like this, CSV files are generally more convenient to work with.
1. It seems the data gets updated quarterly. Ideally the cache files and the EIAID to bus mapping should also be updated quarterly using the newest data.

2. Right. The raw data is only used to produce the cache .pkl files.

3. Do you think it is better to convert the xlsx files to csv first? the read_excel() function in pandas functions quite similarly as read_csv() so it was not that different.
We don't necessarily need to be using the latest at all times, since I don't expect the underlying data to change too much, but we should at least be consistent between the files that are in feat: initial files for DOE flexibility data #282 and the places that we say we got the files from.

If we aren't modifying this data, then we can just say Modifications to source files(s): None. (for this one and the rest as well)

CSVs are preferrable. Pandas has similar functions for loading both, but for other uses .xlsx files are less convenient: e.g. examining the difference between two versions in Git, viewing the files in the browser on Github, etc.

I see. so I should 1) edit the USPS data link to be the exactly same file. 2) change the modifications to none except the xlsx file, which should be changed to csv first?

danielolsen commented 2 years ago

Since we know about the newer file now, let's use that one in #282 (as a CSV). Then we can update the modifications to say 'converted from xlsx to CSV' or something like that.

merrielle commented 1 year ago

closed with #282

Breakthrough-Energy / PreREISE