catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

iso_rto_code not populated in PudlTabl dataframes #301

Closed gschivley closed 5 years ago

gschivley commented 5 years ago

I noticed that the iso_rto_code column only contains None values in MCOE. Could it trace up to the plants_eia860 table definition? That table doesn't have iso_rto_code in it and that's where the values are in the SQL database. This is a whole part of the codebase that I'm very unfamiliar with though. @cmgosnell?

cmgosnell commented 5 years ago

Hmm. Thanks for finding this. I moved the iso_rto_code to plants_entity_eia through the EIA normalization process (Issue #135). So in theory the iso_rto_code should be populated in the plants_entity_eia table but it looks like the plants_entity_eia table is being merged into plants_eia860 in the output function (line 104 of pudl/outputs/eia860).... Is that the part of the PudlTabl object that you are looking at? I'm mucking with some of the guts of the ETL process right now so I don't have an working db to poke at the outputs right now.

gschivley commented 5 years ago

It might help if you can briefly explain how the entity tables work, and if entity tables are supposed to correlate with database tables of the same name. The SQL table plants_entity_eia is just a list of plant IDs. Does the plants_entity_eia PudlTabl object use those EIA plant IDs to pull in columns from other tables?

The SQL table plants_eia860 has columns iso_rto (varchar with values Y and N) and iso_rto_code (also varchar), neither of which are included in the PudlTabl object PlantEIA860. The PudlTabl object PlantEntityEIA has String Columns iso_rto_code and 'iso_rto_name`. These column names are also used in constants.py and in the datapackage json definition.

If I had to guess, the issue might be that the PudlTabl object PlantEIA860 doesn't pull either iso_rto_code or iso_rto from the SQL table. On top of that, there is no column iso_rto_name in the SQL database even though it's used in several places.

cmgosnell commented 5 years ago

Sure! So when we originally pulled in 860 it was sort of rushed and we didn't really change the structure of the original spreadsheets so we had a lot of duplicate values in multiple tables. We also had our hand matched pudl ids as the backbone of the plant and utility tables

The entity tables first we're just the ID's so we could not have our hand mapped pudl_id tables (like plants_eia) be the canonical place for all of the eia entity ids. So I just found all of the tables with plant_id_eia for instance and stored them in the entity tables with new FK relationships.

Then we "harvested" all of the static and annual columns to add into the entity tables and some of the 860 tables. That all happens in pudl/transform/eia but mostly through the harvesting process. I could explain more about how that process is works but mostly it is finding the most correct value for a entity or entity/year and planting it in the entity table or back into the annual 860 tables like plants_eia860... and then removing whatever those columns in the dictionary in constants from their native tables.

Have you re-initialized the datable in a while? From this description it sounds like you might have an old version of the database trying to interact with the updated output object. Because the plants_eia860 table should not have the iso_rto column. I could be super wrong about that, but it's an easy guess.

gschivley commented 5 years ago

I merged in the master branch yesterday, deleted all data files and the SQL database, then followed the getting started docs to download data and rebuild the database. So it should be current.

cmgosnell commented 5 years ago

haha yes that sounds current to me! It's weird that your db has SQL table plants_eia860 has columns iso_rto andiso_rto_code. If you look at the models, theiso_rto_nameandiso_rto_coee` are in plant_entity_eia and not in plants_eia860.

I will say that I wouldn't be surprised to find that I was sloppy in updating the PudlTabl object to reflect the normalization changes, especially for columns like the iso/rto columns that aren't actually used in the any of the calcs.

I am updating my db right now so hopefully I can poke at the tables/output functions to see which space is causing the error. I expect it's something in the output function, but I have been wrong before!

zaneselvans commented 5 years ago

I suspect that the plants_utils_* functions in the output module need to be updated to reflect the new homes of all the static and annual entity values, since IIRC, that is where these values get merged in with the data tables that have annual/monthly records.

cmgosnell commented 5 years ago

@gschivley which PudlTabl method are you using that isn’t pulling in the iso_rto_code?

gschivley commented 5 years ago

It shows up in pudl_out.gens_eia860(). I discovered it in the pudl_output_to_csv.ipynb example notebook when looking at mcoe and worked back up the chain some.

cmgosnell commented 5 years ago

just to clarify, are you saying that the codes do show up in pudl_out.gens_eia860() or they do not? I'm seeing codes from gens_eia860 and from pudl_out.mcoe()

gschivley commented 5 years ago

None is the only unique value for iso_rto_code in pudl_out.mcoe and pudl_out.gens_eia860.

zaneselvans commented 5 years ago

Hmm... when I do:

pudl_out = pudl.output.pudltabl.PudlTabl()
gens_eia860_df = pudl_out.gens_eia860()
gens_eia860_df.iso_rto_code.unique()

I get:

array([None, 'MISO', 'CAISO', 'SPP', 'NYISO', 'ERCOT', 'PJM', 'OTHER',
       'ISONE', 'NBSO'], dtype=object)
cmgosnell commented 5 years ago

For me...

pudl_out = pudl.output.pudltabl.PudlTabl(freq="MS") 
mcoe_df = pudl_out.mcoe()
mcoe_df.iso_rto_code.unique()

also results in:

array([None, 'SPP', 'ERCOT', 'OTHER', 'CAISO', 'MISO', 'PJM', 'ISONE',
       'NYISO'], dtype=object)

:-/

gschivley commented 5 years ago

What years of data did you load into the database? I only brought in 2017 to keep the size down. I'm running the same code and get:

array([None], dtype=object)

If you guys have multiple years I'll try adding more later and see if that somehow fixes the problem.

zaneselvans commented 5 years ago

Oooh, @cmgosnell interesting. Why are they different? What is NBSO and why doesn't it show up in the mcoe output?!

@gschivley I have all of the available years loaded. This is certainly the source of the discrepancy. At some point the iso_rto information stopped being reported, and they switched to reporting balancing authority, which you can find in e.g. gens_eia860.balancing_authority_code and gens_eia860.balancing_authority_name The codes I see for the full dataset are:

array(['SOCO', 'EPE', 'MISO', 'CISO', 'TVA', 'AEC', 'PJM', 'NPPD', None,
       'WACM', 'NYIS', 'SRP', 'SECI', 'AZPS', 'TEPC', 'CSWS', 'SC', 'SEC',
       'BANC', 'WALC', 'PACE', 'TID', 'GRDA', 'SPA', 'JEA', 'WR', 'PACW',
       'ERCO', 'IID', 'LDWP', 'BPAT', 'PSCO', 'WAUW', 'ISNE', 'AVA',
       'FMPP', 'PSEI', 'FPL', 'SCL', 'TEC', 'GVL', 'HST', 'NSB', 'TAL',
       'SCEG', 'SEPA', 'OKGE', 'IPCO', 'EEI', 'GCPD', 'FPC', 'OVEC',
       'CPLE', 'AECI', 'EDE', 'KCPL', 'WFEC', 'KACY', 'LGEE', 'NBSO',
       'SWPP', 'MPS', 'INDN', 'SPRM', 'NWMT', 'OPPD', 'LES', 'NEVP',
       'SPS', 'PNM', 'DUK', 'PGE', 'CHPD', 'DOPD', 'TPWR', 'WAUE', 'YAD',
       'DEAA', 'GRMA', 'CSTO', 'HGMA', 'GWA', 'WWA', 'HECO', 'IS'],
      dtype=object)

@cmgosnell Given that the ISO/RTO associations should be awfully constant over time, I wonder if it would be good to propagate the ISO/RTO codes/names throughout all the available years for generators that do have codes for at least one of their historical records? I'm sure there are other fields for which this kind of discontinuity exists, and that we might improve the usability of the DB by filling them in where appropriate.

cmgosnell commented 5 years ago

New Brunswick!

And we basically did do that. The iso_rto_code is static field for generators, so when any years w/ codes are imported it will show up, but if only years w/ no data are pulled in then this will happen where it just looks empty... We could either automatically pull in another year (which would take more time) or label it to make it clear what's happening.

zaneselvans commented 5 years ago

Ahhh, okay. So all of this is just flowing from the fact that only a single year was being loaded by @gschivley. Ugh. It's really just not going to be practical for us to debug every possible combination of years and data sources. I think having the standardized datapackage outputs published will help avoid this kind of confusion.

cmgosnell commented 5 years ago

🙌 can't wait!

gschivley commented 5 years ago

My SQL database has a column iso_rto_code in the plants_eia860 table with non-null values, but the report dates are from 2011. Which is weird, because I don't have 860 data from 2011 in my data folder. I'll look into this more later, but that's why I seem to have seen the values in SQL and was confused as to why they didn't show up in PudlTabl.

gschivley commented 5 years ago

I'm an idiot and was viewing a remote version of the pudl SQL database that I created at least a year ago rather than the local one that I created last week. Closing this issue.