catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Debug data issues in 2009-2010 EIA860 ETL #484

Closed zaneselvans closed 4 years ago

zaneselvans commented 4 years ago

ETL almost works for EIA 860, but the transform step is failing. Several things probably need to be fixed. The ones we know about:

zaneselvans commented 4 years ago

@swinter2011 were there any other data related issues that you encountered when you were testing the 2009-2010 EIA860 ETL? I feel like you mentioned them somewhere but... I don't know where that is.

zaneselvans commented 4 years ago

Discovered that in the process of enforcing uniform types on the columns in the dataframes, we also ended up inadvertantly converting some NaN values into the string "nan" since that's what you get when you do str(np.nan). I patched a hack into pudl.transform.eia._occurances in which those "nan" strings are turned back into true NaN values before the dropna() is called. Pandas 1.0.0 will address a lot of these issues, with dedicated String, Boolean, and Integer column datatypes, all of which use the pandas.NA value to indicate missing data.

zaneselvans commented 4 years ago

This appears to be done now -- ETL completes successfully, but now we need to update entity mappings including 2009-2010 entities from EIA 860. Especially plants. See #529