catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Some null fields in the EIA860 data #325

Closed briannacote closed 4 years ago

briannacote commented 5 years ago

Describe the bug There are some null fields in a few areas of the EIA860 data. I reinitialized my data this morning. I believe I do have the latest git repository on my machine as well (downloaded last night). I tracked down a few examples of what I am seeing.

To Reproduce Please see the screen shots attached. As one example, minimum load is null for one unit of Plant EIA ID 54. For another plant, 6483 (EIA code), the state/county are null.

This was not the case previously. In the raw EIA 860 data this information is not null.

Thanks for the help! Let me know if you need more information from me or you believe it is a potential setup issue on our side.

Best, Bri

MissingMinLoad MissingStateAndCounty
briannacote commented 5 years ago

For the Minimum Load it looks like the nulls are showing up where EIA860 reports 0MW if that is anything helpful.

briannacote commented 5 years ago

Hello there, I'm just following up on what I've noticed here. Here's another example of where I see this happening:

SELECT --sum(g.capacity_mw) g.plant_id_eia, g.generator_id, g.technology_description, p.plant_name, p.county, p.state, p.latitude, p.longitude, g.capacity_mw, g.nameplate_power_factor, g.summer_capacity_mw, g.winter_capacity_mw, g.minimum_load_mw, pm.operating_date, g.retirement_date, g.time_cold_shutdown_full_load_code, pm.prime_mover_code, g.energy_source_code_1 FROM generators_eia860 g inner join plants_entity_eia p on g.plant_id_eia =p.plant_id_eia right join generators_entity_eia pm on (g.plant_id_eia =pm.plant_id_eia and g.generator_id =pm.generator_id) where g.report_date = '2017-01-01' --and f.report_date = '2017-01-01' --and g.energy_source_code_1 in('BIT', 'RC','SGC','SUB','LIG') and g.plant_id_eia = '621' --and pm.prime_mover_code = 'ST' --and f.prime_mover_code in ('ST') --and f.fuel_type ='BIT' and p.state not in ('AK', 'HI') and g.operational_status_code in ('existing','OP','OS','SB','OA') order by g.plant_id_eia, g.capacity_mw;

I had to repopulate my database recently. I wasn't seeing this before then though. Maybe I should re-initialize?

Screen Shot 2019-07-01 at 10 04 47 AM

Thanks for the help!

cmgosnell commented 5 years ago

Hey @briannacote! Sorry for the slow response on this!! Most of these harvested fields are not perfect. We choose to pull the most consistent record as reported across all the EIA tables and years, but we also required a "strictness" level of 70%. That means at least 70% of the records must be the same for us to use that value. So if values for an entity haven't been reported 70% consistently, then it will show up as a null value. I built in the ability to add special cases for columns where we want to apply a different method to, but the only ones we added was for latitude and longitude because they are by far the dirtiest.

Overall this results in harvested records for upwards of 98% when all of the years are being ingested, and higher rates for single years of data. Every column but the lat/long columns must produce 95% of acceptable values otherwise the initialization would fail due to assertions in the code. Because lat/long is much more messy we set this constraint to 92% for those columns.

We decided that for now this was generally acceptable based on the low number of null fields, but has this been a blocking issue for you? I think the simplest way to get around this would be to select a few columns that are causing you trouble and run them through some special case harvesting process. The absolute easiest would be to run them through the same selection process but with a lower strictness rate would decrease the number of null fields, but I'm not sure if that is actually a good idea. The other most obvious option is to choose the most recently reported record if there isn't a consistently reported value. Again, that wouldn't be perfect by any means but it would certainly reduce user confusion.

joshdr83 commented 5 years ago

That makes sense. I like the idea of using the last record if it does not pass the strictness test, perhaps with a flag that indicates that? Does the test include years for when the plant was in the "proposed" dataset? I think that dataset often doesn't include lat/long or much location information... Thanks!

cmgosnell commented 5 years ago

I know we compile the proposed and retired tabs all into one table, so I think so.. I don't remember exactly. Although the harvesting selection process does ignore null values, so I don't think that's the problem. When I was building out the harvesting process I spent a while trying to determine why the lat/long was so messy and it seemed like there were just truly tons of inconsistent records without any major pattern.

I'm also more and more liking the take the most recent record method as the fall back. It will almost certainly be incorrect some of the time, but it will match what is reported and that will cause much much less confusion.

cmgosnell commented 4 years ago

subsumed within Issue #446