Debug issues with pandas 1.3.x

Several test / build failures appear when using pandas v1.3.0, e.g. #1056 (having to do with the way errors are reported in data package validation) and #1057 (some kind of pyarrow import error while running the docs build). Pinning to v1.2.x avoids the issues, but we should figure out why stuff is suddenly breaking.

We should also really set up our requirements in a form that the dependabot can read, so that we catch these issues as soon as they arise. See #986.

Known pandas 1.3.x issues:

[x] #1056 Error parsing datapackage validation errors.
[x] Failed to set dtypes of boiler_generator_assn_eia860 table prior to output.
[x] #1057 pyarrow / pandas import error / version compatibility check failure during docs build.
[x] nerc_region is suddenly found to be too inconsistent.
[x] Inconsistent inter-annual BGA assignments
[x] Data package validation fails but yields a null set of table-level errors {} due to high memory usage.
[x] Data package validation fails with plenty of memory due to an issue in goodtables_pandas

Parsing datapackage validation errors

The proximate problem here (reported in #1057) has to do with changes in the way data package validation errors are reported. This was easy to hack around by parsing the JSON slightly differently.

Assigning standard data types to the BGA table

But the error reports were being triggered by a real error -- it turns out that we weren't imposing standardized data types on the BGA table, which is created in the harvesting process, after data types have been imposed on the data tables. After entity harvesting & resolution we impose types on the entity tables, but we weren't standardizing the BGA table, and something about how pandas manages data types apparently changed between 1.2.5 and 1.3.2, with the consequence being that the plant_id_eia values in the BGA table were showing up as floats. This was easy to fix by applying standard data types to the BGA table at the end of its transform / creation function. However, there appear to be (at least) two additional problems now.

Import error during docs build

There's some kind of version parsing issue coming up when Sphinx attempts to generate the documentation. Not sure what's going on there. It seems to involve some compatibility check between pandas and pyarrow. Here's the stack trace:

Warning, treated as error:
autodoc: failed to import module 'pudl'; the following exception was raised:
Traceback (most recent call last):
  File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/sphinx/ext/autodoc/importer.py", line 70, in import_module
    return importlib.import_module(modname)
  File "/home/zane/miniconda3/envs/pudl-dev/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/pudl/__init__.py", line 8, in <module>
    import pudl.analysis.allocate_net_gen
  File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/pudl/analysis/allocate_net_gen.py", line 95, in <module>
    import pandas as pd
  File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/pandas/__init__.py", line 22, in <module>
    from pandas.compat import (
  File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/pandas/compat/__init__.py", line 23, in <module>
    from pandas.compat.pyarrow import (
  File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/pandas/compat/pyarrow.py", line 9, in <module>
    _palv = Version(_pa_version)
  File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/pandas/util/version/__init__.py", line 339, in __init__
    match = self._regex.search(version)
TypeError: expected string or bytes-like object

NERC Region inconsistency

Then I tried to run a full ETL on all the years, and got a consistency error in the entity harvesting / resolution process, using exactly the same data and settings as we used for the v0.4.0 release, so I'm guessing this is also somehow linked to changes in pandas and potentially how it manages data types... The failed assertion:

AssertionError: Harvesting of nerc_region is too inconsistent at 0.601.

I wonder how many additional issues will come up...

The issue with the Sphinx doc build appears to be due to the fact that the pyarrow module is mocked for building on ReadTheDocs, but pandas is doing a check to see if the version of pyarrow installed is compatible, and to do that it needs to be able to look up the version of the package, which it can't do because it's mocked. Commenting pyarrow out of the list of mocked packages makes the error go away locally, but then the docs build till fail on RTD because the snappy library isn't installed there.

RTD now lets you specify apt packages to install within the Docker container that's doing the build of the docs, so I updated our .readthedocs.yaml file to install libsnappy-dev and now we don't need to mock any packages, so this all works.

It seems like the problem with the NERC regions being inconsistent had to do with the fact that it's a categorical value, and when you df.groupby() on a categorical, you have to set observed=True otherwise every possible value of the categorical shows up in every group, generally with a bunch of Null values. Changing these groupby() calls in pudl.transform.eia._occurence_consistencty() seems to have fixed the problem:

    occur = (
        col_df
        .groupby(by=cols_to_consit, observed=True)
        .agg({'table': "count"})
        .reset_index()
        .rename(columns={'table': 'entity_occurences'})
    )

And a bit later:

ge(occur, on=cols_to_consit)

    # determine how many instances of each of the records in col exist
    consist_df = (
        col_df
        .groupby(by=cols_to_consit + [col], observed=True)
        .agg({'table': 'count'})
        .reset_index()
        .rename(columns={'table': 'record_occurences'})
    )

However, it's mysterious why this would have been working previously...

Once we get beyond the harvesting process, something goes wrong in the assignment of boiler-generator-associations. In `pudl.transform.eia._boiler_generator_assn() we're getting an assertion error:

AssertionError: Inconsistent inter-annual BGA assignment!

Looking at the warnings that are getting printed out above that, I see that in the midst of the BGA process, the plant_id_eia values are being treated as float values, which is wrong. Something is up in there.

2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=1004.0, unit_id_pudl=3, unit_id_eia=['1' 'G108' 'CT1']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=1904.0, unit_id_pudl=3, unit_id_eia=['HBR0' 'BDS0']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=1927.0, unit_id_pudl=2, unit_id_eia=['RIV0' 'HBR0']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=4040.0, unit_id_pudl=1, unit_id_eia=['PWG1' 'PWG2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=7242.0, unit_id_pudl=1, unit_id_eia=['CC1' 'CC2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=7757.0, unit_id_pudl=1, unit_id_eia=['CC1' 'CC2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=10725.0, unit_id_pudl=1, unit_id_eia=['F801' 'F802']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=50973.0, unit_id_pudl=1, unit_id_eia=['BLK1' 'BLK2' 'BLK3']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=55153.0, unit_id_pudl=1, unit_id_eia=['STG1' 'STG2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=55309.0, unit_id_pudl=1, unit_id_eia=['SMR1' 'SMR2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=55502.0, unit_id_pudl=1, unit_id_eia=['CC1' 'G801' 'CC2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=55701.0, unit_id_pudl=1, unit_id_eia=['CC1' 'G961']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=56041.0, unit_id_pudl=1, unit_id_eia=['NGS' 'MGS']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=56309.0, unit_id_pudl=1, unit_id_eia=['G401' 'G402']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=56350.0, unit_id_pudl=1, unit_id_eia=['BLK1' '115']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=56350.0, unit_id_pudl=2, unit_id_eia=['BLK2' '116']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=56998.0, unit_id_pudl=1, unit_id_eia=['43' 'PB4']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=56998.0, unit_id_pudl=2, unit_id_eia=['53' 'PB5']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=57666.0, unit_id_pudl=1, unit_id_eia=['1' '2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=57794.0, unit_id_pudl=1, unit_id_eia=['CC01' 'CC02']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=60786.0, unit_id_pudl=1, unit_id_eia=['4343' '4141']

There are a bunch of calls to astype() in pudl.transform.eia._boiler_generator_assn() so it seemed likely that this is a data type issue somehow. Cleaning those up to use the canonical types of the columns didn't fix the problem.

I note that a bunch of the unit_id_pudl values are getting set to zero, for the ones where there's no real Unit ID. That seems weird, given that in the current database, and in the current bga_eia860() there are no unit_id_pudl values that are zero. Shouldn't they be NA instead? I guess they're all getting dropped in the filter that happens at the end of the BGA function:

if not debug:
    bga_out = (
        bga_out[
            ~bga_out.missing_from_923
            & ~bga_out.plant_w_bad_generator
            & ~bga_out.unmapped_but_in_923
            & ~bga_out.unmapped
        ]
)

At this point my suspicion is that there's a drop_duplicates() that isn't behaving as expected because a dataframe has duplicate records with the same values encoded in different data types, like string vs. object or dates as strings vs. dates as datetime objects, which would lead to multiple copies of the same boiler-generator pairs getting assigned edges between them in the connected subcompoment analysis. What are some ways we could test for this in debugging?

Check data types of the columns involved:
- plant_id_eia float vs. int
- boiler_id string vs. object
- generator_id string vs. object
- report_date datetime64 vs. string or object

Where's the right place to look at this? Probably immediately before the NetworkX stuff happens.

Yes, it looks like the issue was that plant_id_eia was getting treated as a float in some places and an int in others, and when iit was converted into a string for use in the boiler + generator IDs for consumption by NX, those representations resulted in different strings. God we have to standardize all of the typing further upstream and stop with this perpetual futzing.

Now we can get all the way through producing data package outputs, but datapackage validation fails, and returns no table-level errors. WTF? It turns out there's a default limit of 1000MB of memory usage in frictionless.validate_package() which our validation is exceeding on some of the larger tables. Need to bump that up. The CSV's for all the data are about 500MB and in the worst case it can be 20x as much in memory as pandas dataframes, so 8-10GB should be enough.

Unfortunately, after increasing the memory limit and the number of rows that are validated... we get a KeyError from within @ezwelty's goodtables_pandas library On a freshly output datapackage including ferc1 and eia860 and eia923 data, I run:

report = goodtables.validate(
    str(datapkg_json),
    limit_memory="8GB",
    query={
        "limitRows": 100_000,
    }
)

and the stack trace that come back:

~/miniconda3/envs/pudl-dev/lib/python3.9/site-packages/goodtables_pandas/validate.py in validate(source, source_type, return_tables, **options)
    131         # Parse table
    132         report["tables"][i]["scope"] += ["type-error"]
--> 133         result = parse_table(result, schema=resource.get("schema", {}))
    134         if isinstance(result, list):
    135             report["tables"][i]["errors"] += result

~/miniconda3/envs/pudl-dev/lib/python3.9/site-packages/goodtables_pandas/parse.py in parse_table(df, schema)
     29     errors = []
     30     for field in schema.get("fields", []):
---> 31         result = parse_field(df[field["name"]], **field)
     32         if isinstance(result, ValueTypeError):
     33             # HACK: Add field name to parsing error

~/miniconda3/envs/pudl-dev/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3453             if self.columns.nlevels > 1:
   3454                 return self._getitem_multilevel(key)
-> 3455             indexer = self.columns.get_loc(key)
   3456             if is_integer(indexer):
   3457                 indexer = [indexer]

~/miniconda3/envs/pudl-dev/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'record_id'

However this doesn't happen if I only try and check 10,000 rows. The fact that it's 🤮 ing on record_id means that it's having an issue with one of the FERC tables. The bit at the end about the isna() stuff makes me suspect that it's related to some of the changes in how nullable data types are treated / propagated in pandas 1.3.x.

Another odd thing I see is that with a smaller number of rows being checked, the report that comes back (error free) seems to have a suspiciously large number of perfectly round powers of 2 in the bytes field for the resources being validated. 8,192, 16,384, 32,768, 65,536...

I think I may have gotten overly paranoid here when messing around with the goodtables_pandas validation parameters -- Ethan says the library runs with limit_rows=1 in order to check some headers and other non-data information using the frictionless.validate() function, and then does its own internal pandas-based validation of all of the data. So I think I saw that only a single row was getting checked, and upped the explicit number of rows, which was then getting passed through to frictionless.validate() and causing memory problems over there. Just calling goodtables_pandas.validate(datapkg_json) produces no errors, takes about 1 minute on a full ferc1 + eia output, and according to @ezwelty this should check all of the data.

The discrepancy between what's happening under the hood and what's getting reported in the return value is confusing though.

catalyst-cooperative / pudl