Closed dongqi-wu closed 1 year ago
Location: PreREISE/gather/flexibilitydata/doe
Since there are many files that we're adding in #282, and the openei.org site doesn't list exactly which files are present within each of the zipped folders at https://data.openei.org/submissions/180, I think it would be useful to elaborate on which zipped folder(s) we're using and which file(s) within the folder(s) that we're adding to the doe
folder.
Are there other files within #282 that comes from other sources besides that openei.org submission?
Location: PreREISE/gather/flexibilitydata/doe
Since there are many files that we're adding in #282, and the openei.org site doesn't list exactly which files are present within each of the zipped folders at https://data.openei.org/submissions/180, I think it would be useful to elaborate on which zipped folder(s) we're using and which file(s) within the folder(s) that we're adding to the
doe
folder.Are there other files within #282 that comes from other sources besides that openei.org submission?
I have updated the Issue to reflect the required changes.
Looking at the HUD USPS ZIP Code Crosswalk Files, I see a few differences between the file available from huduser.gov and the file that's in #282: a different number of data rows, a different order of the data, and a different sheet name that seems to imply an older version of the data (COUNTY_ZIP_092021
in #282 vs. SQLT0005
in the file from huduser.gov, which seems to be more recent based on the filename). If this is just a data versioning issue, we should update the version of the file that we're using to the latest.
Regarding Modifications to source files(s)
: we're not modifying this at all, right? In #282 I only see us reading in this data, not making any changes.
For single-sheet data tables like this, CSV files are generally more convenient to work with.
Looking at the HUD USPS ZIP Code Crosswalk Files, I see a few differences between the file available from huduser.gov and the file that's in #282: a different number of data rows, a different order of the data, and a different sheet name that seems to imply an older version of the data (
COUNTY_ZIP_092021
in #282 vs.SQLT0005
in the file from huduser.gov, which seems to be more recent based on the filename). If this is just a data versioning issue, we should update the version of the file that we're using to the latest.Regarding
Modifications to source files(s)
: we're not modifying this at all, right? In #282 I only see us reading in this data, not making any changes.For single-sheet data tables like this, CSV files are generally more convenient to work with.
Looking at the HUD USPS ZIP Code Crosswalk Files, I see a few differences between the file available from huduser.gov and the file that's in #282: a different number of data rows, a different order of the data, and a different sheet name that seems to imply an older version of the data (
COUNTY_ZIP_092021
in #282 vs.SQLT0005
in the file from huduser.gov, which seems to be more recent based on the filename). If this is just a data versioning issue, we should update the version of the file that we're using to the latest. RegardingModifications to source files(s)
: we're not modifying this at all, right? In #282 I only see us reading in this data, not making any changes. For single-sheet data tables like this, CSV files are generally more convenient to work with.1. It seems the data gets updated quarterly. Ideally the cache files and the EIAID to bus mapping should also be updated quarterly using the newest data. 2. Right. The raw data is only used to produce the cache .pkl files. 3. Do you think it is better to convert the xlsx files to csv first? the read_excel() function in pandas functions quite similarly as read_csv() so it was not that different.
Modifications to source files(s): None.
(for this one and the rest as well)Looking at the HUD USPS ZIP Code Crosswalk Files, I see a few differences between the file available from huduser.gov and the file that's in #282: a different number of data rows, a different order of the data, and a different sheet name that seems to imply an older version of the data (
COUNTY_ZIP_092021
in #282 vs.SQLT0005
in the file from huduser.gov, which seems to be more recent based on the filename). If this is just a data versioning issue, we should update the version of the file that we're using to the latest. RegardingModifications to source files(s)
: we're not modifying this at all, right? In #282 I only see us reading in this data, not making any changes. For single-sheet data tables like this, CSV files are generally more convenient to work with.1. It seems the data gets updated quarterly. Ideally the cache files and the EIAID to bus mapping should also be updated quarterly using the newest data. 2. Right. The raw data is only used to produce the cache .pkl files. 3. Do you think it is better to convert the xlsx files to csv first? the read_excel() function in pandas functions quite similarly as read_csv() so it was not that different.
- We don't necessarily need to be using the latest at all times, since I don't expect the underlying data to change too much, but we should at least be consistent between the files that are in feat: initial files for DOE flexibility data #282 and the places that we say we got the files from.
- If we aren't modifying this data, then we can just say
Modifications to source files(s): None.
(for this one and the rest as well)- CSVs are preferrable. Pandas has similar functions for loading both, but for other uses .xlsx files are less convenient: e.g. examining the difference between two versions in Git, viewing the files in the browser on Github, etc.
I see. so I should 1) edit the USPS data link to be the exactly same file. 2) change the modifications to none except the xlsx file, which should be changed to csv first?
Since we know about the newer file now, let's use that one in #282 (as a CSV). Then we can update the modifications to say 'converted from xlsx to CSV' or something like that.
closed with #282
1. DOE Flexibility Data
Source
Destination
General Purpose
The data is used to calculate a demand flexibility number for buses in the synthetic grid model by matching their geographical location to their corresponding LSE specified in the flexibility data.
2. USA Census Bureau Population Data
Source
Destination
General Purpose
The data is used to to determine the weight share of ZIP codes and FIPS based on their population.
3. FIPS and ZIP Conversion Data
Source
Destination
General Purpose
The data is used to compute a bi-directional mapping to be stored as a cache file between ZIP and FIPS regions
4. LSE Service Region Data
Source
Destination
General Purpose
The data is used to look up the service region (list of ZIP codes) for each LSE in the USA.
5. County and State FIPS codes
Source
Destination
General Purpose
The data is used to look up the FIPS number of each county by name