Closed zaneselvans closed 2 years ago
The proximate problem here (reported in #1057) has to do with changes in the way data package validation errors are reported. This was easy to hack around by parsing the JSON slightly differently.
But the error reports were being triggered by a real error -- it turns out that we weren't imposing standardized data types on the BGA table, which is created in the harvesting process, after data types have been imposed on the data tables. After entity harvesting & resolution we impose types on the entity tables, but we weren't standardizing the BGA table, and something about how pandas manages data types apparently changed between 1.2.5 and 1.3.2, with the consequence being that the plant_id_eia
values in the BGA table were showing up as floats. This was easy to fix by applying standard data types to the BGA table at the end of its transform / creation function. However, there appear to be (at least) two additional problems now.
There's some kind of version parsing issue coming up when Sphinx attempts to generate the documentation. Not sure what's going on there. It seems to involve some compatibility check between pandas and pyarrow. Here's the stack trace:
Warning, treated as error:
autodoc: failed to import module 'pudl'; the following exception was raised:
Traceback (most recent call last):
File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/sphinx/ext/autodoc/importer.py", line 70, in import_module
return importlib.import_module(modname)
File "/home/zane/miniconda3/envs/pudl-dev/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/pudl/__init__.py", line 8, in <module>
import pudl.analysis.allocate_net_gen
File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/pudl/analysis/allocate_net_gen.py", line 95, in <module>
import pandas as pd
File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/pandas/__init__.py", line 22, in <module>
from pandas.compat import (
File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/pandas/compat/__init__.py", line 23, in <module>
from pandas.compat.pyarrow import (
File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/pandas/compat/pyarrow.py", line 9, in <module>
_palv = Version(_pa_version)
File "/home/zane/code/catalyst/pudl/.env_pudl/lib/python3.9/site-packages/pandas/util/version/__init__.py", line 339, in __init__
match = self._regex.search(version)
TypeError: expected string or bytes-like object
Then I tried to run a full ETL on all the years, and got a consistency error in the entity harvesting / resolution process, using exactly the same data and settings as we used for the v0.4.0 release, so I'm guessing this is also somehow linked to changes in pandas and potentially how it manages data types... The failed assertion:
AssertionError: Harvesting of nerc_region is too inconsistent at 0.601.
I wonder how many additional issues will come up...
The issue with the Sphinx doc build appears to be due to the fact that the pyarrow
module is mocked for building on ReadTheDocs, but pandas is doing a check to see if the version of pyarrow installed is compatible, and to do that it needs to be able to look up the version of the package, which it can't do because it's mocked. Commenting pyarrow
out of the list of mocked packages makes the error go away locally, but then the docs build till fail on RTD because the snappy
library isn't installed there.
RTD now lets you specify apt packages to install within the Docker container that's doing the build of the docs, so I updated our .readthedocs.yaml
file to install libsnappy-dev
and now we don't need to mock any packages, so this all works.
It seems like the problem with the NERC regions being inconsistent had to do with the fact that it's a categorical value, and when you df.groupby()
on a categorical, you have to set observed=True
otherwise every possible value of the categorical shows up in every group, generally with a bunch of Null values. Changing these groupby()
calls in pudl.transform.eia._occurence_consistencty()
seems to have fixed the problem:
occur = (
col_df
.groupby(by=cols_to_consit, observed=True)
.agg({'table': "count"})
.reset_index()
.rename(columns={'table': 'entity_occurences'})
)
And a bit later:
ge(occur, on=cols_to_consit)
# determine how many instances of each of the records in col exist
consist_df = (
col_df
.groupby(by=cols_to_consit + [col], observed=True)
.agg({'table': 'count'})
.reset_index()
.rename(columns={'table': 'record_occurences'})
)
However, it's mysterious why this would have been working previously...
Once we get beyond the harvesting process, something goes wrong in the assignment of boiler-generator-associations. In `pudl.transform.eia._boiler_generator_assn() we're getting an assertion error:
AssertionError: Inconsistent inter-annual BGA assignment!
Looking at the warnings that are getting printed out above that, I see that in the midst of the BGA process, the plant_id_eia
values are being treated as float
values, which is wrong. Something is up in there.
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=1004.0, unit_id_pudl=3, unit_id_eia=['1' 'G108' 'CT1']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=1904.0, unit_id_pudl=3, unit_id_eia=['HBR0' 'BDS0']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=1927.0, unit_id_pudl=2, unit_id_eia=['RIV0' 'HBR0']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=4040.0, unit_id_pudl=1, unit_id_eia=['PWG1' 'PWG2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=7242.0, unit_id_pudl=1, unit_id_eia=['CC1' 'CC2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=7757.0, unit_id_pudl=1, unit_id_eia=['CC1' 'CC2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=10725.0, unit_id_pudl=1, unit_id_eia=['F801' 'F802']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=50973.0, unit_id_pudl=1, unit_id_eia=['BLK1' 'BLK2' 'BLK3']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=55153.0, unit_id_pudl=1, unit_id_eia=['STG1' 'STG2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=55309.0, unit_id_pudl=1, unit_id_eia=['SMR1' 'SMR2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=55502.0, unit_id_pudl=1, unit_id_eia=['CC1' 'G801' 'CC2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=55701.0, unit_id_pudl=1, unit_id_eia=['CC1' 'G961']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=56041.0, unit_id_pudl=1, unit_id_eia=['NGS' 'MGS']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=56309.0, unit_id_pudl=1, unit_id_eia=['G401' 'G402']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=56350.0, unit_id_pudl=1, unit_id_eia=['BLK1' '115']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=56350.0, unit_id_pudl=2, unit_id_eia=['BLK2' '116']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=56998.0, unit_id_pudl=1, unit_id_eia=['43' 'PB4']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=56998.0, unit_id_pudl=2, unit_id_eia=['53' 'PB5']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=57666.0, unit_id_pudl=1, unit_id_eia=['1' '2']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=57794.0, unit_id_pudl=1, unit_id_eia=['CC01' 'CC02']
2021-08-23 15:43:40 [ WARNING] pudl.transform.eia:784 Multiple EIA unit codes:plant_id_eia=60786.0, unit_id_pudl=1, unit_id_eia=['4343' '4141']
There are a bunch of calls to astype()
in pudl.transform.eia._boiler_generator_assn()
so it seemed likely that this is a data type issue somehow. Cleaning those up to use the canonical types of the columns didn't fix the problem.
I note that a bunch of the unit_id_pudl
values are getting set to zero, for the ones where there's no real Unit ID. That seems weird, given that in the current database, and in the current bga_eia860()
there are no unit_id_pudl
values that are zero. Shouldn't they be NA instead? I guess they're all getting dropped in the filter that happens at the end of the BGA function:
if not debug:
bga_out = (
bga_out[
~bga_out.missing_from_923
& ~bga_out.plant_w_bad_generator
& ~bga_out.unmapped_but_in_923
& ~bga_out.unmapped
]
)
At this point my suspicion is that there's a drop_duplicates()
that isn't behaving as expected because a dataframe has duplicate records with the same values encoded in different data types, like string
vs. object
or dates as strings vs. dates as datetime objects, which would lead to multiple copies of the same boiler-generator pairs getting assigned edges between them in the connected subcompoment analysis. What are some ways we could test for this in debugging?
plant_id_eia
float vs. intboiler_id
string vs. objectgenerator_id
string vs. objectreport_date
datetime64 vs. string or objectWhere's the right place to look at this? Probably immediately before the NetworkX stuff happens.
Yes, it looks like the issue was that plant_id_eia
was getting treated as a float in some places and an int in others, and when iit was converted into a string for use in the boiler + generator IDs for consumption by NX, those representations resulted in different strings. God we have to standardize all of the typing further upstream and stop with this perpetual futzing.
Now we can get all the way through producing data package outputs, but datapackage validation fails, and returns no table-level errors. WTF? It turns out there's a default limit of 1000MB of memory usage in frictionless.validate_package()
which our validation is exceeding on some of the larger tables. Need to bump that up. The CSV's for all the data are about 500MB and in the worst case it can be 20x as much in memory as pandas dataframes, so 8-10GB should be enough.
Unfortunately, after increasing the memory limit and the number of rows that are validated... we get a KeyError
from within @ezwelty's goodtables_pandas library On a freshly output datapackage including ferc1
and eia860
and eia923
data, I run:
report = goodtables.validate(
str(datapkg_json),
limit_memory="8GB",
query={
"limitRows": 100_000,
}
)
and the stack trace that come back:
~/miniconda3/envs/pudl-dev/lib/python3.9/site-packages/goodtables_pandas/validate.py in validate(source, source_type, return_tables, **options)
131 # Parse table
132 report["tables"][i]["scope"] += ["type-error"]
--> 133 result = parse_table(result, schema=resource.get("schema", {}))
134 if isinstance(result, list):
135 report["tables"][i]["errors"] += result
~/miniconda3/envs/pudl-dev/lib/python3.9/site-packages/goodtables_pandas/parse.py in parse_table(df, schema)
29 errors = []
30 for field in schema.get("fields", []):
---> 31 result = parse_field(df[field["name"]], **field)
32 if isinstance(result, ValueTypeError):
33 # HACK: Add field name to parsing error
~/miniconda3/envs/pudl-dev/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
3453 if self.columns.nlevels > 1:
3454 return self._getitem_multilevel(key)
-> 3455 indexer = self.columns.get_loc(key)
3456 if is_integer(indexer):
3457 indexer = [indexer]
~/miniconda3/envs/pudl-dev/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'record_id'
However this doesn't happen if I only try and check 10,000 rows. The fact that it's 🤮 ing on record_id
means that it's having an issue with one of the FERC tables. The bit at the end about the isna()
stuff makes me suspect that it's related to some of the changes in how nullable data types are treated / propagated in pandas 1.3.x.
Another odd thing I see is that with a smaller number of rows being checked, the report that comes back (error free) seems to have a suspiciously large number of perfectly round powers of 2 in the bytes
field for the resources being validated. 8,192, 16,384, 32,768, 65,536...
I think I may have gotten overly paranoid here when messing around with the goodtables_pandas
validation parameters -- Ethan says the library runs with limit_rows=1
in order to check some headers and other non-data information using the frictionless.validate()
function, and then does its own internal pandas-based validation of all of the data. So I think I saw that only a single row was getting checked, and upped the explicit number of rows, which was then getting passed through to frictionless.validate()
and causing memory problems over there. Just calling goodtables_pandas.validate(datapkg_json) produces no errors, takes about 1 minute on a full ferc1 + eia output, and according to @ezwelty this should check all of the data.
The discrepancy between what's happening under the hood and what's getting reported in the return value is confusing though.
Several test / build failures appear when using pandas v1.3.0, e.g. #1056 (having to do with the way errors are reported in data package validation) and #1057 (some kind of pyarrow import error while running the docs build). Pinning to v1.2.x avoids the issues, but we should figure out why stuff is suddenly breaking.
We should also really set up our requirements in a form that the dependabot can read, so that we catch these issues as soon as they arise. See #986.
Known pandas 1.3.x issues:
boiler_generator_assn_eia860
table prior to output.nerc_region
is suddenly found to be too inconsistent.{}
due to high memory usage.goodtables_pandas