Breakthrough-Energy / PreREISE

Generate input data for scenario framework
https://breakthrough-energy.github.io/docs/
MIT License
20 stars 28 forks source link

DOE Demand Flexibility Data Intake #290

Closed dongqi-wu closed 1 year ago

dongqi-wu commented 2 years ago

1. DOE Flexibility Data

Source

Destination

General Purpose

The data is used to calculate a demand flexibility number for buses in the synthetic grid model by matching their geographical location to their corresponding LSE specified in the flexibility data.

2. USA Census Bureau Population Data

Source

Destination

General Purpose

The data is used to to determine the weight share of ZIP codes and FIPS based on their population.

3. FIPS and ZIP Conversion Data

Source

General Purpose

The data is used to compute a bi-directional mapping to be stored as a cache file between ZIP and FIPS regions

4. LSE Service Region Data

Source

General Purpose

The data is used to look up the service region (list of ZIP codes) for each LSE in the USA.

5. County and State FIPS codes

Source

General Purpose

The data is used to look up the FIPS number of each county by name

danielolsen commented 2 years ago

Location: PreREISE/gather/flexibilitydata/doe

Since there are many files that we're adding in #282, and the openei.org site doesn't list exactly which files are present within each of the zipped folders at https://data.openei.org/submissions/180, I think it would be useful to elaborate on which zipped folder(s) we're using and which file(s) within the folder(s) that we're adding to the doe folder.

Are there other files within #282 that comes from other sources besides that openei.org submission?

dongqi-wu commented 2 years ago

Location: PreREISE/gather/flexibilitydata/doe

Since there are many files that we're adding in #282, and the openei.org site doesn't list exactly which files are present within each of the zipped folders at https://data.openei.org/submissions/180, I think it would be useful to elaborate on which zipped folder(s) we're using and which file(s) within the folder(s) that we're adding to the doe folder.

Are there other files within #282 that comes from other sources besides that openei.org submission?

I have updated the Issue to reflect the required changes.

danielolsen commented 2 years ago

Looking at the HUD USPS ZIP Code Crosswalk Files, I see a few differences between the file available from huduser.gov and the file that's in #282: a different number of data rows, a different order of the data, and a different sheet name that seems to imply an older version of the data (COUNTY_ZIP_092021 in #282 vs. SQLT0005 in the file from huduser.gov, which seems to be more recent based on the filename). If this is just a data versioning issue, we should update the version of the file that we're using to the latest.

Regarding Modifications to source files(s): we're not modifying this at all, right? In #282 I only see us reading in this data, not making any changes.

For single-sheet data tables like this, CSV files are generally more convenient to work with.

dongqi-wu commented 2 years ago

Looking at the HUD USPS ZIP Code Crosswalk Files, I see a few differences between the file available from huduser.gov and the file that's in #282: a different number of data rows, a different order of the data, and a different sheet name that seems to imply an older version of the data (COUNTY_ZIP_092021 in #282 vs. SQLT0005 in the file from huduser.gov, which seems to be more recent based on the filename). If this is just a data versioning issue, we should update the version of the file that we're using to the latest.

Regarding Modifications to source files(s): we're not modifying this at all, right? In #282 I only see us reading in this data, not making any changes.

For single-sheet data tables like this, CSV files are generally more convenient to work with.

  1. It seems the data gets updated quarterly. Ideally the cache files and the EIAID to bus mapping should also be updated quarterly using the newest data.
  2. Right. The raw data is only used to produce the cache .pkl files.
  3. Do you think it is better to convert the xlsx files to csv first? the read_excel() function in pandas functions quite similarly as read_csv() so it was not that different.
danielolsen commented 2 years ago

Looking at the HUD USPS ZIP Code Crosswalk Files, I see a few differences between the file available from huduser.gov and the file that's in #282: a different number of data rows, a different order of the data, and a different sheet name that seems to imply an older version of the data (COUNTY_ZIP_092021 in #282 vs. SQLT0005 in the file from huduser.gov, which seems to be more recent based on the filename). If this is just a data versioning issue, we should update the version of the file that we're using to the latest. Regarding Modifications to source files(s): we're not modifying this at all, right? In #282 I only see us reading in this data, not making any changes. For single-sheet data tables like this, CSV files are generally more convenient to work with.

1. It seems the data gets updated quarterly. Ideally the cache files and the EIAID to bus mapping should also be updated quarterly using the newest data.

2. Right. The raw data is only used to produce the cache .pkl files.

3. Do you think it is better to convert the xlsx files to csv first? the read_excel() function in pandas functions quite similarly as read_csv() so it was not that different.
  1. We don't necessarily need to be using the latest at all times, since I don't expect the underlying data to change too much, but we should at least be consistent between the files that are in #282 and the places that we say we got the files from.
  2. If we aren't modifying this data, then we can just say Modifications to source files(s): None. (for this one and the rest as well)
  3. CSVs are preferrable. Pandas has similar functions for loading both, but for other uses .xlsx files are less convenient: e.g. examining the difference between two versions in Git, viewing the files in the browser on Github, etc.
dongqi-wu commented 2 years ago

Looking at the HUD USPS ZIP Code Crosswalk Files, I see a few differences between the file available from huduser.gov and the file that's in #282: a different number of data rows, a different order of the data, and a different sheet name that seems to imply an older version of the data (COUNTY_ZIP_092021 in #282 vs. SQLT0005 in the file from huduser.gov, which seems to be more recent based on the filename). If this is just a data versioning issue, we should update the version of the file that we're using to the latest. Regarding Modifications to source files(s): we're not modifying this at all, right? In #282 I only see us reading in this data, not making any changes. For single-sheet data tables like this, CSV files are generally more convenient to work with.

1. It seems the data gets updated quarterly. Ideally the cache files and the EIAID to bus mapping should also be updated quarterly using the newest data.

2. Right. The raw data is only used to produce the cache .pkl files.

3. Do you think it is better to convert the xlsx files to csv first? the read_excel() function in pandas functions quite similarly as read_csv() so it was not that different.
  1. We don't necessarily need to be using the latest at all times, since I don't expect the underlying data to change too much, but we should at least be consistent between the files that are in feat: initial files for DOE flexibility data #282 and the places that we say we got the files from.
  2. If we aren't modifying this data, then we can just say Modifications to source files(s): None. (for this one and the rest as well)
  3. CSVs are preferrable. Pandas has similar functions for loading both, but for other uses .xlsx files are less convenient: e.g. examining the difference between two versions in Git, viewing the files in the browser on Github, etc.

I see. so I should 1) edit the USPS data link to be the exactly same file. 2) change the modifications to none except the xlsx file, which should be changed to csv first?

danielolsen commented 2 years ago

Since we know about the newer file now, let's use that one in #282 (as a CSV). Then we can update the modifications to say 'converted from xlsx to CSV' or something like that.

merrielle commented 1 year ago

closed with #282