catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Create EIA entity tables #135

Closed cmgosnell closed 5 years ago

cmgosnell commented 6 years ago

As a part of the transform process, once we have all of the dataframes for the data tables created and cleaned, we can create the entity tables by compiling all instances of each entity. The entities are (at least) plants, generators, boilers, and utilities. These tables can be used to compile information about these entities that can be pulled out of the more data-like tables (this especially needs to be done for 860).

cmgosnell commented 5 years ago

I'm thinking about removing the tables from 923 that are more like entity tables (generators_eia923 and boilers_eia923). The content won't be lost, because these tables are generated by other tables anyway. And this will help the ingest be a little bit faster because it won't be normalizing things twice.

I'm also thinking about removing the prime_mover_code_eia923 table and just using the constants prime_movers_eia923 dictionary to enumerate the possible records. Any objections or issues I'm not seeing?

zaneselvans commented 5 years ago

I think moving those entity-like pieces of info into the entity tables early on totally makes sense. I almost turned the prime mover codes / descriptions into a list of descriptive codes like with many of the other little glue tables, and I think that's the right thing to do, but I wasn't in a place to be mucking with the transform step at that point, so I only did the simplest re-org of those tables.

cmgosnell commented 5 years ago

I'm glad we're on the same page... because I've mostly already finished with this!

On making this work for the outputs, I'm thinking it might be nice to either a) consolidate all of the EIA output functions into one module, or b) create another EIA module. I'm leaning towards to former bc I've removed so much of the content from many of the 860 tables in this harvesting process, including several full tables.

karldw commented 5 years ago

I've come across a minor issue with the location in the plants_entity_eia table: the (lon, lat) for Clayville in New Jersery (plant ID 58235) is (-39.424444, 39.424444). The actual location is (-75.0201, 39.4245). There could be other plants with issues like this; I only found this one because the point doesn't have a timezone in timezonefinder.

I'm not sure what the easiest way would be to add tests for the location. Maybe verifying that the lat/lon is in the state we expect?

gschivley commented 5 years ago

You could certainly check each plant to see if they are in the state in question (sample code). The next step would be assigning a timezone based on the state/county, which would require a pre-populated list of timezones for each state/county.

At that point, do we just assume that state/county info for each power plant is correct and merge the timezone in using the pre-populated list?

zaneselvans commented 5 years ago

@karldw That seems like maybe a data entry error on the part of the utility. Is it like that for all of the years, or just one? @cmgosnell is close to finished with the normalization of EIA 860, and I think her process ensures uniform values in those fields...

karldw commented 5 years ago

@zaneselvans, definitely a data entry error. Looking through the excel sheets, it seems like it was reported erroneously in 2012 and correctly after that.

@cmgosnell, I think this is a small bug in the consistency normalization. Here's some demo code:

import pandas as pd
import pudl
# Testing note: this df needs at least 4 rows for the 70% consistency cutoff to matter
df = pd.DataFrame({
    "plant_id_eia": [1, 1, 1, 1], 
    "report_date": [2011, 2012, 2013, 2014], 
    "latitude": [2, -1, 2, 2]
})
clean_df = df.head(0)
plants_df = pd.DataFrame({"plant_id_eia": [1]})
pudl.transform.eia._lat_long(df, clean_df, plants_df, "latitude")
# This one works like it should:
#>    plant_id_eia  latitude latitude_consistent  occurances  report_date
#> 0             1         2                True         4.0         2011

# But this one doesn't, only changing the order of latitude:
df = pd.DataFrame({
    "plant_id_eia": [1, 1, 1, 1], 
    "report_date": [2011, 2012, 2013, 2014], 
    "latitude": [-1, 2, 2, 2]
})
pudl.transform.eia._lat_long(df, clean_df, plants_df, "latitude")
#>    plant_id_eia  latitude latitude_consistent  occurances  report_date
#> 0             1        -1                   1         4.0         2011

And then because python views 1 == True as true, latitude_consistent gets treated as a true value and -1 is then treated as the correct, consistent latitude.

@cmgosnell and @zaneselvans, I can turn the code above into a test if you'd like. Should it go in etl_tests.py or somewhere else?

@gschivley, I made a mapping of states to timezones as part of my PR, but I'm not certain that was a good idea, since many states have multiple timezones. (A few counties do too.) I think it's worth getting the latitude/longitude right for a bunch of applications, not just timezones, and if we have lat/long, using timezonefinder avoids these edge cases. Thanks for the demo code!