catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

EIA 923 Potential Mismatch? #615

Closed briannacote closed 4 years ago

briannacote commented 4 years ago

Hi there,

I was wondering if I could reach out for some insight around 923 data pulled from an SQLite database created from the ETL process.

I recently setup an SQLite database with 860 and 923 data on a new server. I used years 2011-2018 to populate the data. I noticed some differences in the data I pulled from 923 compared to my previous setup on a different machine. This other machine still has an SQLite database filled with the same years for the 860 and 923 data that I created earlier this year or the end of last year. To note, I have not yet found any differences in the 860 data I pulled on both machines. These discrepancies seem to be just in the 923 data.

As an example, EIA plant ID 4, Walter Bouldin Dam, showed something different from before in the fuel_consumed_mmbtu and fuel_consumed_for_electricity_mmbtu columns of the generation_fuel_eia923 table. Here's a small, quick SQL snippet I created to pull data:

'''statement = ''' SELECT *
    from generation_fuel_eia923 
    where report_date >= '2018-01-01'
   and report_date <= '2018-12-01'
   and plant_id_eia=4
;
 '''

test = pd.read_sql(statement, pudl_engine)
print(test)

The values for the fuel_consumed_mmbtu and fuel_consumed_for_electricity_mmbtu columns for that year for this site (data reported monthly) do not match what comes directly from the EIA 923 data. My previous setup did and does since I have not changed the database since at least the start of the year. Furthermore, in general, I see these discrepancies predominantly for renewable assets (Wind, Solar and Hydro). But there are definitely some other technologies this also affects. I also see discrepancies in net_generation_mwh as well for some plants as well. The example I gave for plant_id_eia = 4 was not one of them.

In general, I see a few spots where this discrepancy could come from: -I somehow setup the databases differently on the two machines on my end. I realize the above example attaches to my pudl_engine which will be different from one you have. -Something changed in the tables that I'm not accounting for in my SQL statement. -Something changed on your end?

Any insight would be helpful, thank you! Brianna

briannacote commented 4 years ago

Please let me know if I can share additional information that would be helpful here!

zaneselvans commented 4 years ago

Hey Brianna, additional information that would be helpful:

briannacote commented 4 years ago

On my old machine, catalystcoop.pudl was installed via pip catalystcoop.pudl 0.3.1

On the new machine, catalystcoop.pudl was installed via conda catalystcoop.pudl 0.3.2 py_0

I downloaded fresh data for both from: pudl_data --sources eia923 eia860 --years 2011 2012 2013 2014 2015 2016 2017 2018

I did not use any of the archived Zenedo data archives.

My original setup matched the data from the spreadsheets directly from the EIA website for 923. My new setup shows the discrepancies I mentioned above.

The discrepancies are numerical differences. I still see the same number of plants from each data pull. At plant level these are generally smaller discrepancies but they add up over the US fleet. As an example, in most cases the new fuel_consumed_mmbtu values I see are less than previous.

Forgive the messiness of the attached screenshot. But this gives an idea the differences I am observing. These values sum up the 12 months in 2018 the data for each plant (based on prime mover and fuel type).

Screen Shot 2020-05-26 at 2 35 30 PM

Again the "Old" in this screenshot is from my old setup that matched exactly what I saw from the raw 923 excel files.

I hope this helps. Let me know what else I can answer.

briannacote commented 4 years ago

To help further what I'm seeing attached is a screenshot that shows the EIA 923 raw files against a data pull from my new PUDL setup for EIA Plant Id 4.

Here's the quick code I used to pull the data in this screenshot from the new database I created:

import pudl

#### General User Settings
YEAR = 2018
# This can be done for either Existing or Proposed sites. 
instance = "Existing"

#### Set up connection to a local SQLite database
pudl_settings = pudl.workspace.setup.get_defaults()
print("Attaching to database:",pudl_settings["pudl_db"])
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])

#### 923 Data Pull from SQLite DB 
statement_923 = ''' SELECT *
    from generation_fuel_eia923 
    where report_date >= '%s-01-01'
   and report_date <= '%s-12-01'
   and plant_id_eia=4; 
 ''' % (YEAR, YEAR)

print("Pulling 923 data for", YEAR)
df_923 = pd.read_sql(statement_923, pudl_engine)
print(df_923)

Screen Shot 2020-05-26 at 2 45 02 PM

zaneselvans commented 4 years ago

When you are comparing the data in your database against the data in the spreadsheet, are you comparing both databases against the same spreadsheet? Or are you comparing each database against the particular EIA spreadsheets that were used as the raw inputs for the creation of that database?

If you're comparing both databases against the spreadsheets that were downloaded and used to populate your old database, my guess is that what you're seeing is the result of revisions in the original EIA data. They frequently go back and alter it without any kind of notification! This is one of the reasons that we started creating the data releases and including the raw data that goes into them alongside the processed outputs -- otherwise it's impossible to ensure that you'll get the same results, since there's no versioning or archiving of the original spreadsheets being done by EIA.

If you haven't already, can you check the values in your new database against the values that you find in the spreadsheets that went into creating it? And/or check the values in the ye olde spreadsheets against the new spreadsheets?

briannacote commented 4 years ago

Ahhh, indeed you are correct on this! Good grief. I was worried it was something regarding setup. This answer is frustrating for an entirely different reason :P

Thanks for the help here. It's good to know. I appreciate the quick response! Hope all is well.

zaneselvans commented 4 years ago

Yeah, it's bonkers that there's no kind of version control on their data! @ptvirgo has been working on a system that will allow us to automate the archiving of these original datasets along with enough metadata to make them programmatically accessible on Zenodo. It's up and running on the Zenodo sandbox server, and the next version of PUDL will use these archives as its input, instead of pulling directly from EIA, so that we can always be sure that a given version of PUDL is pulling a particular version of the raw data. Otherwise we can't help but occasionally end up in a situation where they change the structure of the data, instead of just the values, causing a released version of the software to no longer work when pulling the original data (which is no longer the data that the software was designed to consume).

briannacote commented 4 years ago

That's awesome! And a really cool way to attack that issue. It's definitely not an easy issue. I look forward to getting setup on the newer version once that's available.