USEPA / standardizedinventories

Standardized Release and Waste Inventories
MIT License
25 stars 16 forks source link

eGRID 2020 column name change #153

Closed dt-woods closed 4 months ago

dt-woods commented 5 months ago

Stewi's getInventoryFacilities method for "eGRID" returns a data frame using row 1's column names from "PLNT16" or "PLNT20" worksheet of the respectively downloaded Excel workbook (e.g., 'egrid2016.xlsx' and 'eGRID2020_Data_v2.xlsx'); however, these names fail a consistency check between 2016 and 2020. Row 2 of the worksheet includes a keyword for the column, which appears to be consistent. This creates a challenge for data users when dealing with multi-annual datasets (i.e., I have to write a check for multiple column names rather than a single check against the keyword).

Testing Stewi's getInventoryFacilities method for "eGRID" 2020, it seems to be missing the primary fuel category.

Reproducible example:

>>> import stewi
>>> df_2016 = stewi.getInventoryFacilities("eGRID", 2016)
>>> df_2020 = stewi.getInventoryFacilities("eGRID", 2020)
>>> df_2016.columns
Index(['FacilityID', 'FacilityName', 'Address', 'City', 'State', 'Zip',
       'Latitude', 'Longitude', 'County', 'NAICS', 'SIC', 'UrbanRural',
       'Plant operator name', 'Balancing Authority Name',
       'Balancing Authority Code', 'NERC region acronym',
       'eGRID subregion acronym', 'Plant primary fuel',
       'Plant primary coal/oil/gas/ other fossil fuel category',  # Fuel category
       'Plant coal generation percent (resource mix)',
       'Plant oil generation percent (resource mix)',
       'Plant gas generation percent (resource mix)',
       'Plant nuclear generation percent (resource mix)',
       'Plant hydro generation percent (resource mix)',
       'Plant biomass generation percent (resource mix)',
       'Plant wind generation percent (resource mix)',
       'Plant solar generation percent (resource mix)',
       'Plant geothermal generation percent (resource mix)',
       'Plant other fossil generation percent (resource mix)',
       'Plant other unknown / purchased fuel generation percent (resource mix)'],
      dtype='object')
>>> df_2020.columns
Index(['FacilityID', 'FacilityName', 'Address', 'City', 'State', 'Zip',
       'Latitude', 'Longitude', 'County', 'NAICS', 'SIC', 'UrbanRural',
       'Plant operator name', 'Balancing Authority Name',
       'Balancing Authority Code', 'NERC region acronym',
       'eGRID subregion acronym', 'Plant primary fuel',  # Fuel category missing!!!
       'Plant coal generation percent (resource mix)',
       'Plant oil generation percent (resource mix)',
       'Plant gas generation percent (resource mix)',
       'Plant nuclear generation percent (resource mix)',
       'Plant hydro generation percent (resource mix)',
       'Plant biomass generation percent (resource mix)',
       'Plant wind generation percent (resource mix)',
       'Plant solar generation percent (resource mix)',
       'Plant geothermal generation percent (resource mix)',
       'Plant other fossil generation percent (resource mix)',
       'Plant other unknown / purchased fuel generation percent (resource mix)'],
      dtype='object')

In its respective data file, the W column "Plant primary coal/oil/gas/ other fossil fuel category" represents the "PLFUELCT" keyword I am looking for in eGRID 2016.

egrid2016_data

In 2020, the now Y column name is changed to "Plant primary fuel category" (note the keyword is still "PLFUELCT") and, for some reason, is not provided in the data frame.

egrid2020_data

Would it be possible to include the PLFUELCT column in all eGRID facility datasets?

bl-young commented 5 months ago

thanks - good spot. I will take a closer look as soon as I can but I expect we can get this updated.

bl-young commented 5 months ago

For some reason we were missing that column from the 2020 and 2021 files, perhaps because the name changed, I'm not quite sure. I will add that we use the file that I edited in 39a49bb to ensure consistent field names across years, which we realized a few years back when we started adding more years. Not the cleanest system, but it should ensure the same column names across data years for the facilities file (this specific issue notwithstanding)

bl-young commented 5 months ago

I've confirmed that column has been added for 2020 and 2021. I will pull into master but may not be able to update data commons immediately w/ a new release for egrid data (v1.1.3).

bl-young commented 5 months ago

thank you for reporting @dt-woods!

dt-woods commented 5 months ago

I appreciate the quick turnaround, Ben, and look forward to the release of the new revision.

bl-young commented 2 weeks ago

@dt-woods the new processed egrid files for 2020 and 2021 are now up in data commons for v1.1.3