catalyst-cooperative / ferc-xbrl-extractor

A tool for converting FERC filings published in XBRL into SQLite databases
MIT License
11 stars 1 forks source link

Test official arelle package #36

Closed zschira closed 1 year ago

zschira commented 1 year ago

Arelle has begun officially packaging the library for PyPI here. It would probably make sense to switch to this version assuming it doesn't break anything, so we can get updates/improvements

zaneselvans commented 1 year ago

@zschira Thinking about the locale issue that forced us to fork Arelle in the first place, I was wondering if we had tried re-setting the locale within the XBRL extraction script itself? Like, is there a way that we can undo the change that Arelle makes once we're finished using Arelle? It doesn't seem to cause a problem in the XBRL extraction -- just downstream in other scripts that need to deal with character encoding that apparently varies by locale.

There's a mention of fixing some locale issues in Arelle in this issue about Python 3.11 compatibility and I wonder if they are related to our problem. I didn't see any other issues discussing locale bugs though.

Unfortunately the same locale issue that caused us to fork the Arelle repo is still present in the current released version (v2.2.2). It doesn't cause any problems until quite a ways downstream -- during the EIA923 ETL when we are adding FIPS codes to the coal mine table.

It's wild to me that this effect persists not just within a given Python process, but across processes in the OS environment. So bad.

src/pudl/etl.py:515: in etl
    sqlite_dfs.update(_etl_eia(datasets["eia"], ds_kwargs))
src/pudl/etl.py:103: in _etl_eia
    eia923_transformed_dfs = pudl.transform.eia923.transform(
src/pudl/transform/eia923.py:1225: in transform
    eia923_transform_functions[table](eia923_raw_dfs, eia923_transformed_dfs)
src/pudl/transform/eia923.py:996: in coalmine
    cmi_df = _coalmine_cleanup(cmi_df)
src/pudl/transform/eia923.py:457: in _coalmine_cleanup
    cmi_df.assign(
../../../mambaforge/envs/pudl-dev/lib/python3.10/site-packages/pandas/core/generic.py:5512: in pipe
    return com.pipe(self, func, *args, **kwargs)
../../../mambaforge/envs/pudl-dev/lib/python3.10/site-packages/pandas/core/common.py:497: in pipe
    return func(obj, *args, **kwargs)
src/pudl/helpers.py:190: in add_fips_ids
    af = addfips.AddFIPS(vintage=vintage)
../../../mambaforge/envs/pudl-dev/lib/python3.10/site-packages/addfips/addfips.py:66: in __init__
    self._counties = self._load_county_data(vintage)
../../../mambaforge/envs/pudl-dev/lib/python3.10/site-packages/addfips/addfips.py:85: in _load_county_data
    for row in csv.DictReader(f):
../../../mambaforge/envs/pudl-dev/lib/python3.10/csv.py:111: in __next__
    row = next(self.reader)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <encodings.ascii.IncrementalDecoder object at 0x1696218a0>
input = b'on County\n28,153,Wayne County\n28,155,Webster County\n28,157,Wilkinson County\n28,159,Winston County\n28,161,Yalobu...nesee County\n36,039,Greene County\n36,041,Hamilton County\n36,043,Herkimer County\n36,045,Jefferson County\n36,047,Ki'
final = False

    def decode(self, input, final=False):
>       return codecs.ascii_decode(input, self.errors)[0]
E       UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7113: ordinal not in range(128)

../../../mambaforge/envs/pudl-dev/lib/python3.10/encodings/ascii.py:26: UnicodeDecodeError