Closed bendnorman closed 8 months ago
@TrentonBush, notebooks/37-bdn-explore-gridstatus-data.ipynb
is a good starting place for interacting with the archived gridstatus data.
The archives are somewhat up to date. The combination of flaky ISO APIs and gridstatus basic pandas cleaning becoming out of date make the archiving a little flaky :/
@TrentonBush I did some basic cleaning of the grid status data. A few of thoughts:
While trying to write transformed results to a .parquet file, I ran into an issue of mixed types in the output.:
d = df.convert_dtypes().dtypes
obj_cols = d[d.eq('object')].index
for c in obj_cols:
types = gs_dfs_dec17[c].apply(type).unique()
print(f"{c}:\t\t{types}\n")
output:
proposed_completion_date: [<class 'datetime.datetime'> <class 'pandas._libs.missing.NAType'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.nattype.NaTType'>]
queue_date: [<class 'datetime.datetime'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.nattype.NaTType'>]
queue_id: [<class 'str'> <class 'int'>]
summer_capacity_mw: [<class 'float'> <class 'pandas._libs.missing.NAType'> <class 'str'>]
winter_capacity_mw: [<class 'float'> <class 'pandas._libs.missing.NAType'> <class 'str'>]
withdrawal_comment: [<class 'pandas._libs.missing.NAType'>]
Just pushed a change that should correct the mixed dtype issue.
Ok! I realized by filtering by "Status" == "Active"
I was removing a lot of projects that LBNL considered "active" and "nearly certain". Now I have some logic in each ISO cleaning function to filter out in service and withdrawn projects. I also normalized the data into projects and capacities. @TrentonBush your normalize_multicolumns_to_rows
function was life saver thank you!
With these changes the overall capacity differences look pretty good for each ISO. I think the variability can be explained by the 12 month delay for SPP and ISONE and the 4 month delay for all other ISOs. There are a couple of remaining issues that I uncovered using notebook 42:
OK I finally added the deduplication and fixed the output tests. I think this is ready to go?
Yes! Let's merge it in.
This PR creates a function for caching and extracting grid status interconnection queue data.
I also upgraded: