catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Update PUDL to be compatible with pandas 2.0 #2394

Closed zaneselvans closed 10 months ago

zaneselvans commented 1 year ago

Work out dependency issues and update PUDL to work with pandas 2.0. Major improvements include the option to back dataframes with Arrow arrays, providing much richer data types and perfect compatibility with PyArrow generated Parquet file outputs.

- [x] Update `ferc-xbrl-extractor` to work with / allow pandas 2.0
- [x] Update `recordlinkage` to a version that works with pandas 2.0.
- [x] Extracting `eia860` (but **not** `eia923` or `eia861`) we get `AttributeError: 'XlsxWriter' object has no attribute 'save'` Seems to be related to streaming the DBF data as an Excel "file" in 2001-2003.
- [x] Failure in `df.replace()` within `fix_eia_na()` with `raise ValueError('cannot call `vectorize` on size 0 inputs '`
- [x] Census DP1 seems to have SQLAlchemy syntax issues
- [x] Replace deprecated `df.append()` with `pd.concat()` in `boiler_emissions_control_equipment_assn`
- [x] Convert `None` to numeic NA in `utility_plant_summary_ferc1.calculated_amount` when reconciling calcs
- [x] `clean_hourly_demand_matrix_ferc714` `pd.option_context("display.max_colwidth", -1)` need positive integer?
- [x] `fipsified_respondents_ferc714` failed merge on `object` and `datetime64[s]` columns
- [x] `income_statement_ferc1` divide by zero: had None rather than numeric NA in calculated amounts.
- [x] gen fuel by gen esc yearly: can't cast dtype('<M8[ns]') to dtype('<M8[s]')
- [x] gen fuel by gen esc monthly: cannot reindex on axis with duplicate labels. `plant_year` index in `distribute_annually_reported_data_to_months_if_annual()` is apparently non-unique.
- [x] `denorm_plants_steam_ferc1` `generate_rolling_avg` applies `mean()` to non-numeric columns.
- [x] `denorm_fuel_receipts_costs_eia923` `generate_rolling_avg` applies `mean()` to Categorical column.
- [x] in `plants_small_ferc1` transform: `ValueError: 'utility_id_ferc1' is both an index level and a column label, which is ambiguous.`
- [x] Re-run full ETL with all fixes applied.
- [x] Run full integration tests
- [x] Run data validations locally on the full DB.
- [x] Fix mixed-date format column in tests if data validations pass without exercising that behavior.

Known Issues

The following errors were generated when attempting to run the ETL on the pandas-2.0 branch in PR #2320.

Mixed date formats

Stack Trace ``` ____________________________________________________________________________________________ ERROR collecting test/unit/transform/classes_test.py ____________________________________________________________________________________________ test/unit/transform/classes_test.py:354: in ).astype( ../../../mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/generic.py:6297: in astype res_col = col.astype(dtype=cdt, copy=copy, errors=errors) ../../../mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/generic.py:6316: in astype new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) ../../../mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/internals/managers.py:448: in astype return self.apply( ../../../mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/internals/managers.py:349: in apply applied = getattr(b, f)(**kwargs) ../../../mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/internals/blocks.py:511: in astype new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) ../../../mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/dtypes/astype.py:242: in astype_array_safe new_values = astype_array(values, dtype, copy=copy) ../../../mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/dtypes/astype.py:187: in astype_array values = _astype_nansafe(values, dtype, copy=copy) ../../../mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/dtypes/astype.py:114: in _astype_nansafe dti = to_datetime(arr.ravel()) ../../../mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/tools/datetimes.py:1079: in to_datetime result = convert_listlike(argc, format) ../../../mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/tools/datetimes.py:450: in _convert_listlike_datetimes return _array_strptime_with_fallback(arg, name, utc, format, exact, errors) ../../../mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/tools/datetimes.py:481: in _array_strptime_with_fallback result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors, utc=utc) pandas/_libs/tslibs/strptime.pyx:530: in pandas._libs.tslibs.strptime.array_strptime ??? pandas/_libs/tslibs/strptime.pyx:351: in pandas._libs.tslibs.strptime.array_strptime ??? E ValueError: time data "2020-01-02" doesn't match format "%Y-%m-%dT%H:%M:%S.%f", at position 1. You might want to try: E - passing `format` if your strings have a consistent format; E - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format; E - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.: Error while type casting for column 'report_date' ```

Issue turning old DBF data into Excel file objects for extraction

Stack Trace ``` AttributeError: 'XlsxWriter' object has no attribute 'save' Stack Trace: File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary yield File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_utils/__init__.py", line 472, in iterate_with_context next_output = next(iterator) ^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/compute_generator.py", line 122, in _coerce_solid_compute_fn_to_iterator result = invoke_compute_fn( ^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/compute_generator.py", line 116, in invoke_compute_fn return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/extract/eia860.py", line 105, in extract_eia860 eia860_raw_dfs = Extractor(ds).extract(year=eia_settings.eia860.years) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/extract/excel.py", line 255, in extract self.load_excel_file(page, **partition), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/extract/excel.py", line 338, in load_excel_file excel_file = pudl.helpers.convert_df_to_excel_file(df, index=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/helpers.py", line 1562, in convert_df_to_excel_file writer.save() ^^^^^^^^^^^ The above exception occurred during handling of the following exception: KeyError: "No resources found for eia860: {'name': 'GENY01.dbf'}" Stack Trace: File "/Users/zane/code/catalyst/pudl/src/pudl/extract/excel.py", line 324, in load_excel_file res = self.ds.get_unique_resource( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/workspace/datastore.py", line 383, in get_unique_resource raise KeyError(f"No resources found for {dataset}: {filters}") The above exception occurred during handling of the following exception: StopIteration Stack Trace: File "/Users/zane/code/catalyst/pudl/src/pudl/workspace/datastore.py", line 381, in get_unique_resource _, content = next(res) ^^^^^^^^^ ```

df.replace() fails with null input

Fixed with a workaround, and reported as a pandas bug

Stack Trace ``` ValueError: cannot call `vectorize` on size 0 inputs unless `otypes` is set Stack Trace: File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary yield File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_utils/__init__.py", line 472, in iterate_with_context next_output = next(iterator) ^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/compute_generator.py", line 122, in _coerce_solid_compute_fn_to_iterator result = invoke_compute_fn( ^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/compute_generator.py", line 116, in invoke_compute_fn return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass) ^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/eia861.py", line 1213, in demand_response_eia861 raw_dr = _pre_process(raw_demand_response_eia861) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/eia861.py", line 544, in _pre_process fix_eia_na(df) File "/Users/zane/code/catalyst/pudl/src/pudl/helpers.py", line 969, in fix_eia_na return df.replace( ^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/frame.py", line 5575, in replace return super().replace( ^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/generic.py", line 7346, in replace new_data = self._mgr.replace_list( ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 493, in replace_list bm = self.apply( ^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 349, in apply applied = getattr(b, f)(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/internals/blocks.py", line 770, in replace_list for i, ((src, dest), mask) in enumerate(zip(pairs, masks)): File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/internals/blocks.py", line 748, in compare_or_regex_search( File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/array_algos/replace.py", line 96, in compare_or_regex_search result = op(a) ^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/numpy/lib/function_base.py", line 2329, in __call__ return self._vectorize_call(func=func, args=vargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/numpy/lib/function_base.py", line 2407, in _vectorize_call ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/numpy/lib/function_base.py", line 2363, in _get_ufunc_and_otypes raise ValueError('cannot call `vectorize` on size 0 inputs ' ```

Ambiguous column and row label in plants_small_ferc1

Stack Trace ``` ValueError: 'utility_id_ferc1' is both an index level and a column label, which is ambiguous. Stack Trace: File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary yield File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_utils/__init__.py", line 472, in iterate_with_context next_output = next(iterator) ^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/compute_generator.py", line 122, in _coerce_solid_compute_fn_to_iterator result = invoke_compute_fn( ^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/dagster/_core/execution/plan/compute_generator.py", line 116, in invoke_compute_fn return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass) ^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/ferc1.py", line 3785, in ferc1_transform_asset df = transformer.transform( ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/classes.py", line 1189, in transform .pipe(self.transform_main) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/generic.py", line 5918, in pipe return common.pipe(self, func, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/common.py", line 518, in pipe return func(obj, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/classes.py", line 1022, in _wrapper df = func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/ferc1.py", line 2165, in transform_main .pipe(self.prep_header_fuel_and_plant_types) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/generic.py", line 5918, in pipe return common.pipe(self, func, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/common.py", line 518, in pipe return func(obj, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/code/catalyst/pudl/src/pudl/transform/ferc1.py", line 2778, in prep_header_fuel_and_plant_types header_groups = df.groupby( ^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/frame.py", line 8241, in groupby return DataFrameGroupBy( ^^^^^^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 930, in __init__ grouper, exclusions, obj = get_grouper( ^^^^^^^^^^^^ File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/groupby/grouper.py", line 975, in get_grouper obj._check_label_or_level_ambiguity(gpr, axis=axis) File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/generic.py", line 1734, in _check_label_or_level_ambiguity raise ValueError(msg) ```

Census DP1 SQLAlchemy syntax

Stack Trace ``` .env_tox/lib/python3.11/site-packages/pudl/output/censusdp1tract.py:72: in census_layer return get_layer(layer, dp1_engine) .env_tox/lib/python3.11/site-packages/pudl/output/censusdp1tract.py:43: in get_layer df = pd.read_sql( .env_tox/lib/python3.11/site-packages/pandas/io/sql.py:663: in read_sql return pandas_sql.read_query( .env_tox/lib/python3.11/site-packages/pandas/io/sql.py:1738: in read_query result = self.execute(sql, params) .env_tox/lib/python3.11/site-packages/pandas/io/sql.py:1562: in execute return self.con.exec_driver_sql(sql, *args) .env_tox/lib/python3.11/site-packages/sqlalchemy/engine/base.py:1768: in exec_driver_sql args_10style, kwargs_10style = _distill_params_20(parameters) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ params = ['county_2010census_dp1'] def _distill_params_20(params): if params is None: return _no_tuple, _no_kw elif isinstance(params, list): # collections_abc.MutableSequence): # avoid abc.__instancecheck__ if params and not isinstance( params[0], (collections_abc.Mapping, tuple) ): > raise exc.ArgumentError( "List argument must consist only of tuples or dictionaries" ) E sqlalchemy.exc.ArgumentError: List argument must consist only of tuples or dictionaries ```
zaneselvans commented 1 year ago

I think @nelsonauner is working on this.

nelsonauner commented 1 year ago

Yeah - I am!

zaneselvans commented 1 year ago

I've put it into our current sprint so we know it's a thing to pay attention to. Let us know if you need anything or are feeling stuck!

nelsonauner commented 1 year ago

Sounds good! I tried to assign it to myself but didn't have permissions

zaneselvans commented 1 year ago

Okay I moved the detailed notes from the PR #2320 into this issue, and also merged dev into the pandas-2.0 branch so it's up to date.

zaneselvans commented 10 months ago

@cmgosnell A couple of the issues above are coming from this statement in pudl.helpers.generate_rolling_avg() where it's trying to take the mean() of all columns in the groupby() even though many of them aren't numeric. Do you have any recollection of what's supposed to be going on here? Honestly it seems kinda weird that this ever worked.

    # merge the date range and the groups together
    # to get the backbone/complete date range/groups
    bones = (
        date_range.merge(groups)
        .drop(columns="tmp")  # drop the temp column
        .merge(df, on=group_cols + ["report_date"])
        .set_index(group_cols + ["report_date"])
        .groupby(by=group_cols + ["report_date"])
        # BUG: This mean() is operating on all columns, but they aren't all numeric
        # and some of the numeric columns are IDs... which doesn't seem right. With
        # pandas 2 it fails when trying to average strings and categoricals.
        .mean()
    )

edit: turns out we just needed to use mean(numeric_only=True) to ensure the prior behavior.

zaneselvans commented 10 months ago

@aesharpe AFAIK the one remaining (known) blocker on pandas 2.0 is coming from this block of code in the small plants table:`

        header_groups = df.groupby(
            [
                "utility_id_ferc1",
                "report_year",
                (df["row_type"] == "header").cumsum(),
            ]
        )

Which is producing:

 ValueError: 'utility_id_ferc1' is both an index level and a column label, which is ambiguous.

Not sure why, and I wasn't clear on what's going on with the use of the cumsum() as one of the groupby columns. I haven't dug into it yet.

aesharpe commented 10 months ago

@aesharpe AFAIK the one remaining (known) blocker on pandas 2.0 is coming from this block of code in the small plants table:`

        header_groups = df.groupby(
            [
                "utility_id_ferc1",
                "report_year",
                (df["row_type"] == "header").cumsum(),
            ]
        )

Which is producing:

 ValueError: 'utility_id_ferc1' is both an index level and a column label, which is ambiguous.

Not sure why, and I wasn't clear on what's going on with the use of the cumsum() as one of the groupby columns. I haven't dug into it yet.

The cumsum() is basically creating a groupby index based on where there are identified header rows. The final groupby objects will be groups of utilities, years, and header groups within those. For example:

utility_id_ferc1 report_year row_type
1 2020 header
1 2020 NA
1 2020 NA
1 2020 header
1 2020 NA

Becomes the following groups:

utility_id_ferc1 report_year row_type
1 2020 header
1 2020 NA
1 2020 NA

and

utility_id_ferc1 report_year row_type
1 2020 header
1 2020 NA

My guess with the ValueError is that this index labeling thing is getting in the way. I'm not sure how much index labeling was a part of the prior pandas, but it seems like the it has something to do with the utility_id_ferc1 columns being used as an index and a column or switching back and forth. Would probably have to read more about index labels before fixing.

zaneselvans commented 10 months ago

Ah it turns out the problem was simpler. For some reason utility_id_ferc1 and report_year were both showing up as the index of the dataframe, and as columns. reset_index(drop=True) got rid of the conflict.