Closed zaneselvans closed 10 months ago
I think @nelsonauner is working on this.
Yeah - I am!
I've put it into our current sprint so we know it's a thing to pay attention to. Let us know if you need anything or are feeling stuck!
Sounds good! I tried to assign it to myself but didn't have permissions
Okay I moved the detailed notes from the PR #2320 into this issue, and also merged dev
into the pandas-2.0
branch so it's up to date.
@cmgosnell A couple of the issues above are coming from this statement in pudl.helpers.generate_rolling_avg()
where it's trying to take the mean()
of all columns in the groupby()
even though many of them aren't numeric. Do you have any recollection of what's supposed to be going on here? Honestly it seems kinda weird that this ever worked.
# merge the date range and the groups together
# to get the backbone/complete date range/groups
bones = (
date_range.merge(groups)
.drop(columns="tmp") # drop the temp column
.merge(df, on=group_cols + ["report_date"])
.set_index(group_cols + ["report_date"])
.groupby(by=group_cols + ["report_date"])
# BUG: This mean() is operating on all columns, but they aren't all numeric
# and some of the numeric columns are IDs... which doesn't seem right. With
# pandas 2 it fails when trying to average strings and categoricals.
.mean()
)
edit: turns out we just needed to use mean(numeric_only=True)
to ensure the prior behavior.
@aesharpe AFAIK the one remaining (known) blocker on pandas 2.0 is coming from this block of code in the small plants table:`
header_groups = df.groupby(
[
"utility_id_ferc1",
"report_year",
(df["row_type"] == "header").cumsum(),
]
)
Which is producing:
ValueError: 'utility_id_ferc1' is both an index level and a column label, which is ambiguous.
Not sure why, and I wasn't clear on what's going on with the use of the cumsum()
as one of the groupby columns. I haven't dug into it yet.
@aesharpe AFAIK the one remaining (known) blocker on pandas 2.0 is coming from this block of code in the small plants table:`
header_groups = df.groupby( [ "utility_id_ferc1", "report_year", (df["row_type"] == "header").cumsum(), ] )
Which is producing:
ValueError: 'utility_id_ferc1' is both an index level and a column label, which is ambiguous.
Not sure why, and I wasn't clear on what's going on with the use of the
cumsum()
as one of the groupby columns. I haven't dug into it yet.
The cumsum()
is basically creating a groupby index based on where there are identified header rows. The final groupby objects will be groups of utilities, years, and header groups within those. For example:
utility_id_ferc1 | report_year | row_type |
---|---|---|
1 | 2020 | header |
1 | 2020 | NA |
1 | 2020 | NA |
1 | 2020 | header |
1 | 2020 | NA |
Becomes the following groups:
utility_id_ferc1 | report_year | row_type |
---|---|---|
1 | 2020 | header |
1 | 2020 | NA |
1 | 2020 | NA |
and
utility_id_ferc1 | report_year | row_type |
---|---|---|
1 | 2020 | header |
1 | 2020 | NA |
My guess with the ValueError
is that this index labeling thing is getting in the way. I'm not sure how much index labeling was a part of the prior pandas, but it seems like the it has something to do with the utility_id_ferc1
columns being used as an index and a column or switching back and forth. Would probably have to read more about index labels before fixing.
Ah it turns out the problem was simpler. For some reason utility_id_ferc1
and report_year
were both showing up as the index of the dataframe, and as columns. reset_index(drop=True)
got rid of the conflict.
Work out dependency issues and update PUDL to work with pandas 2.0. Major improvements include the option to back dataframes with Arrow arrays, providing much richer data types and perfect compatibility with PyArrow generated Parquet file outputs.
Known Issues
The following errors were generated when attempting to run the ETL on the
pandas-2.0
branch in PR #2320.Mixed date formats
pd.to_datetime(
format="mixed")`.test/unit/transform/classes_test.py
Stack Trace
``` ____________________________________________________________________________________________ ERROR collecting test/unit/transform/classes_test.py ____________________________________________________________________________________________ test/unit/transform/classes_test.py:354: inIssue turning old DBF data into Excel file objects for extraction
AttributeError: 'XlsxWriter' object has no attribute 'save'
Stack Trace
``` AttributeError: 'XlsxWriter' object has no attribute 'save' Stack Trace: File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary yield File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_utils/__init__.py", line 472, in iterate_with_context next_output = next(iterator) ^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/compute_generator.py", line 122, in _coerce_solid_compute_fn_to_iterator result = invoke_compute_fn( ^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/compute_generator.py", line 116, in invoke_compute_fn return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/extract/eia860.py", line 105, in extract_eia860 eia860_raw_dfs = Extractor(ds).extract(year=eia_settings.eia860.years) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/extract/excel.py", line 255, in extract self.load_excel_file(page, **partition), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/extract/excel.py", line 338, in load_excel_file excel_file = pudl.helpers.convert_df_to_excel_file(df, index=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/helpers.py", line 1562, in convert_df_to_excel_file writer.save() ^^^^^^^^^^^ The above exception occurred during handling of the following exception: KeyError: "No resources found for eia860: {'name': 'GENY01.dbf'}" Stack Trace: File "/Users/zane/code/catalyst/pudl/src/pudl/extract/excel.py", line 324, in load_excel_file res = self.ds.get_unique_resource( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/workspace/datastore.py", line 383, in get_unique_resource raise KeyError(f"No resources found for {dataset}: {filters}") The above exception occurred during handling of the following exception: StopIteration Stack Trace: File "/Users/zane/code/catalyst/pudl/src/pudl/workspace/datastore.py", line 381, in get_unique_resource _, content = next(res) ^^^^^^^^^ ```df.replace()
fails with null inputFixed with a workaround, and reported as a pandas bug
df.replace()
withinfix_eia_na()
withraise ValueError('cannot call
vectorizeon size 0 inputs '
demand_response_eia861
energy_efficiency_eia861
sales_eia861
non_net_metering_eia861
service_territory_eia861
reliability_eia861
utility_data_eia861
advanced_metering_infrastructure_eia861
clean_balancing_authority_eia861
distribution_systems_eia861
operational_data_eia861
net_metering_eia861
mergers_eia861
Stack Trace
``` ValueError: cannot call `vectorize` on size 0 inputs unless `otypes` is set Stack Trace: File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary yield File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_utils/__init__.py", line 472, in iterate_with_context next_output = next(iterator) ^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/compute_generator.py", line 122, in _coerce_solid_compute_fn_to_iterator result = invoke_compute_fn( ^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/compute_generator.py", line 116, in invoke_compute_fn return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass) ^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/eia861.py", line 1213, in demand_response_eia861 raw_dr = _pre_process(raw_demand_response_eia861) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/eia861.py", line 544, in _pre_process fix_eia_na(df) File "/Users/zane/code/catalyst/pudl/src/pudl/helpers.py", line 969, in fix_eia_na return df.replace( ^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/frame.py", line 5575, in replace return super().replace( ^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/generic.py", line 7346, in replace new_data = self._mgr.replace_list( ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 493, in replace_list bm = self.apply( ^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 349, in apply applied = getattr(b, f)(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/internals/blocks.py", line 770, in replace_list for i, ((src, dest), mask) in enumerate(zip(pairs, masks)): File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/internals/blocks.py", line 748, inAmbiguous column and row label in
plants_small_ferc1
plants_small_ferc1
transform we're gettingValueError: 'utility_id_ferc1' is both an index level and a column label, which is ambiguous.
Stack Trace
``` ValueError: 'utility_id_ferc1' is both an index level and a column label, which is ambiguous. Stack Trace: File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary yield File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_utils/__init__.py", line 472, in iterate_with_context next_output = next(iterator) ^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/compute_generator.py", line 122, in _coerce_solid_compute_fn_to_iterator result = invoke_compute_fn( ^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/compute_generator.py", line 116, in invoke_compute_fn return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass) ^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/ferc1.py", line 3785, in ferc1_transform_asset df = transformer.transform( ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/classes.py", line 1189, in transform .pipe(self.transform_main) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/generic.py", line 5918, in pipe return common.pipe(self, func, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/common.py", line 518, in pipe return func(obj, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/classes.py", line 1022, in _wrapper df = func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/ferc1.py", line 2165, in transform_main .pipe(self.prep_header_fuel_and_plant_types) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/generic.py", line 5918, in pipe return common.pipe(self, func, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/common.py", line 518, in pipe return func(obj, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/ferc1.py", line 2778, in prep_header_fuel_and_plant_types header_groups = df.groupby( ^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/frame.py", line 8241, in groupby return DataFrameGroupBy( ^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 930, in __init__ grouper, exclusions, obj = get_grouper( ^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/groupby/grouper.py", line 975, in get_grouper obj._check_label_or_level_ambiguity(gpr, axis=axis) File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/generic.py", line 1734, in _check_label_or_level_ambiguity raise ValueError(msg) ```Census DP1 SQLAlchemy syntax
Stack Trace
``` .env_tox/lib/python3.11/site-packages/pudl/output/censusdp1tract.py:72: in census_layer return get_layer(layer, dp1_engine) .env_tox/lib/python3.11/site-packages/pudl/output/censusdp1tract.py:43: in get_layer df = pd.read_sql( .env_tox/lib/python3.11/site-packages/pandas/io/sql.py:663: in read_sql return pandas_sql.read_query( .env_tox/lib/python3.11/site-packages/pandas/io/sql.py:1738: in read_query result = self.execute(sql, params) .env_tox/lib/python3.11/site-packages/pandas/io/sql.py:1562: in execute return self.con.exec_driver_sql(sql, *args) .env_tox/lib/python3.11/site-packages/sqlalchemy/engine/base.py:1768: in exec_driver_sql args_10style, kwargs_10style = _distill_params_20(parameters) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ params = ['county_2010census_dp1'] def _distill_params_20(params): if params is None: return _no_tuple, _no_kw elif isinstance(params, list): # collections_abc.MutableSequence): # avoid abc.__instancecheck__ if params and not isinstance( params[0], (collections_abc.Mapping, tuple) ): > raise exc.ArgumentError( "List argument must consist only of tuples or dictionaries" ) E sqlalchemy.exc.ArgumentError: List argument must consist only of tuples or dictionaries ```