catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
458 stars 105 forks source link

Non-unique merge keys in FERC1 assets with 2021-2022 data #2899

Closed jdangerx closed 9 months ago

jdangerx commented 9 months ago
pandas.errors.MergeError: Merge keys are not unique in left dataset; not a one-to-one merge

  File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 948, in calculate_values_from_components
    calculated_df = pd.merge(
jdangerx commented 9 months ago

OK, gonna dive into electric_operating_expenses_ferc1 and hope any lessons generalize.

We're seeings lots of duplicates in the data when we're trying to assign calculated values to reported values.

dupes = data[data.duplicated(subset=data_idx, keep=False)].sort_values(data_idx)
dupes.info()
<class 'pandas.core.frame.DataFrame'>
Index: 17442 entries, 29960 to 75103

A lot of these are just straight-up full duplicates. Which we should find the source of, but we could also just drop for now. Post-drop we only have 170, which is 1% of before.

A bunch of these are just 'off-by-one reports' but some are not:

utility_id_ferc1_xbrl  report_year                                  xbrl_factoid  dollar_value
C000030         2021           administrative_and_general_expenses      433663.0
C000030         2021           administrative_and_general_expenses   254796713.0
C000030         2021  administrative_and_general_operation_expense      433663.0
C000030         2021  administrative_and_general_operation_expense   244364200.0

C000030 is PJM - and if we look in the XBRL for these dollar values, we see...

dazhong-catalyst@computer ~/p/d/2/ferc1> ag '254796713'
PJM_Interconnection,_L.L.C._form1_Q4_1649716496.xbrl
5666:    <ferc:AdministrativeAndGeneralExpenses id="f-456" contextRef="cr-00002" unitRef="u-02" decimals="0">254796713</ferc:AdministrativeAndGeneralExpenses>

dazhong-catalyst@computer ~/p/d/2/ferc1 [1]> ag 244364200
PJM_Interconnection,_L.L.C._form1_Q4_1649716496.xbrl
5664:    <ferc:AdministrativeAndGeneralOperationExpense id="f-454" contextRef="cr-00002" unitRef="u-02" decimals="0">244364200</ferc:AdministrativeAndGeneralOperationExpense>

dazhong-catalyst@computer ~/p/d/2/ferc1> ag '433663' 
PJM_Interconnection,_L.L.C._form1_Q4_1649716127.xbrl
1467:    <ferc:AdministrativeAndGeneralOperationExpense id="f-196" contextRef="cr-00002" unitRef="u-02" decimals="0">433663</ferc:AdministrativeAndGeneralOperationExpense>

1468:    <ferc:AdministrativeAndGeneralExpenses id="f-197" contextRef="cr-00002" unitRef="u-02" decimals="0">433663</ferc:AdministrativeAndGeneralExpenses>

So it looks like there were two filings - in the first one, on Mon Apr 11 18:28:47 EDT 2022, both these values were reported as 433663. Then, in the second one, on Mon Apr 11 18:34:56 EDT 2022, both these values were reported with much greater values.

Questions:

  1. Why didn't the extractor collapse these two values into the later value?
jdangerx commented 9 months ago

The extractor didn't collapse those two values because it only dedupes within a single year of data!

The 2022 report from PJM Interconnection has values for 2021 and 2022 - and the 2021 value matches the newer one from the 2021 report.

```xml C000030 2021-01-01 2021-12-31 C000030 2022-01-01 2022-12-31 254796713 259766554 ```

But these records don't get deduped because we only dedupe across all the instances in a single archive. So the deduping has to happen across all of the instances across all years.

Additionally, both 2021 reports share a ReportDate, which due to https://github.com/catalyst-cooperative/pudl/issues/2822 means that the deduplication is ambiguous - and we actually appear to be grabbing the wrong value from 2021. So we get the lower 2021 value from the 2021 filings, and the higher 2021 and 2022 values from the 2022 filing.

I think we need to do two things:

  1. Actually address https://github.com/catalyst-cooperative/pudl/issues/2822 - this requires including the actual report publication date in our process.
  2. Handle multiple instance sources in get_instances so that we can parse multiple years of data at once. a. do this in a way that doesn't stop us from using different taxonomies per-instance, in the future - the 2023 data will be using a different taxonomy.
jdangerx commented 9 months ago

Output of discussion with @zaneselvans and @zschira:

We want to keep all the reported data in the extracted SQLite databases, and just do the deduplication PUDL-side. To do that:

jdangerx commented 9 months ago

Adding the publication date from rssfeed means that people who want to use ferc-xbrl-extractor outside of our FERC XBRL archives would need to construct an rssfeed file that exactly matches our undocumented format.

I went down a bit of a rabbit hole trying to figure out the least annoying way to make that work best for those users.

  1. Do they exist? It sort of seems like our XBRL archives are the most convenient way to get the FERC XBRL filings anyway - we've done all the RSS feed parsing & downloading, etc.

  2. Regardless, I think the nicest thing to do is change the script to use an --instance-config option that takes in a yaml file like this:

    workers: 3
    batch_size: 10
    clobber: True
    
    sqlite_out: ferc1.sqlite
    metadata_out: metadata.json
    datapackage_out: datapackage.json
    
    taxonomy_location: bar.zip
    taxonomy_relpath: taxonomy/....
    
    filing_location: foo.zip
    filings:
     - relpath: abc.xbrl
       publication_date: 2023-05-11T01:23:45.678Z
     - ...

    Which gets parsed into memory as some sort of config object.

    Then our existing flags could just be used as shorthand to set that config.

    I guess that seems pretty straightforward, so it's probably worth doing.

jdangerx commented 9 months ago

Parts 1 and 2 of the above list are being fixed here: https://github.com/catalyst-cooperative/ferc-xbrl-extractor/pull/151

zaneselvans commented 9 months ago

The Pydantic settings models make it super easy to do this kind of thing!