Closed jdangerx closed 9 months ago
OK, gonna dive into electric_operating_expenses_ferc1
and hope any lessons generalize.
We're seeings lots of duplicates in the data when we're trying to assign calculated values to reported values.
dupes = data[data.duplicated(subset=data_idx, keep=False)].sort_values(data_idx)
dupes.info()
<class 'pandas.core.frame.DataFrame'>
Index: 17442 entries, 29960 to 75103
A lot of these are just straight-up full duplicates. Which we should find the source of, but we could also just drop for now. Post-drop we only have 170, which is 1% of before.
A bunch of these are just 'off-by-one reports' but some are not:
utility_id_ferc1_xbrl report_year xbrl_factoid dollar_value
C000030 2021 administrative_and_general_expenses 433663.0
C000030 2021 administrative_and_general_expenses 254796713.0
C000030 2021 administrative_and_general_operation_expense 433663.0
C000030 2021 administrative_and_general_operation_expense 244364200.0
C000030 is PJM - and if we look in the XBRL for these dollar values, we see...
dazhong-catalyst@computer ~/p/d/2/ferc1> ag '254796713'
PJM_Interconnection,_L.L.C._form1_Q4_1649716496.xbrl
5666: <ferc:AdministrativeAndGeneralExpenses id="f-456" contextRef="cr-00002" unitRef="u-02" decimals="0">254796713</ferc:AdministrativeAndGeneralExpenses>
dazhong-catalyst@computer ~/p/d/2/ferc1 [1]> ag 244364200
PJM_Interconnection,_L.L.C._form1_Q4_1649716496.xbrl
5664: <ferc:AdministrativeAndGeneralOperationExpense id="f-454" contextRef="cr-00002" unitRef="u-02" decimals="0">244364200</ferc:AdministrativeAndGeneralOperationExpense>
dazhong-catalyst@computer ~/p/d/2/ferc1> ag '433663'
PJM_Interconnection,_L.L.C._form1_Q4_1649716127.xbrl
1467: <ferc:AdministrativeAndGeneralOperationExpense id="f-196" contextRef="cr-00002" unitRef="u-02" decimals="0">433663</ferc:AdministrativeAndGeneralOperationExpense>
1468: <ferc:AdministrativeAndGeneralExpenses id="f-197" contextRef="cr-00002" unitRef="u-02" decimals="0">433663</ferc:AdministrativeAndGeneralExpenses>
So it looks like there were two filings - in the first one, on Mon Apr 11 18:28:47 EDT 2022, both these values were reported as 433663. Then, in the second one, on Mon Apr 11 18:34:56 EDT 2022, both these values were reported with much greater values.
The extractor didn't collapse those two values because it only dedupes within a single year of data!
The 2022 report from PJM Interconnection has values for 2021 and 2022 - and the 2021 value matches the newer one from the 2021 report.
But these records don't get deduped because we only dedupe across all the instances in a single archive. So the deduping has to happen across all of the instances across all years.
Additionally, both 2021 reports share a ReportDate
, which due to https://github.com/catalyst-cooperative/pudl/issues/2822 means that the deduplication is ambiguous - and we actually appear to be grabbing the wrong value from 2021. So we get the lower 2021 value from the 2021 filings, and the higher 2021 and 2022 values from the 2022 filing.
I think we need to do two things:
get_instances
so that we can parse multiple years of data at once.
a. do this in a way that doesn't stop us from using different taxonomies per-instance, in the future - the 2023 data will be using a different taxonomy.Output of discussion with @zaneselvans and @zschira:
We want to keep all the reported data in the extracted SQLite databases, and just do the deduplication PUDL-side. To do that:
Adding the publication date from rssfeed
means that people who want to use ferc-xbrl-extractor
outside of our FERC XBRL archives would need to construct an rssfeed
file that exactly matches our undocumented format.
I went down a bit of a rabbit hole trying to figure out the least annoying way to make that work best for those users.
Do they exist? It sort of seems like our XBRL archives are the most convenient way to get the FERC XBRL filings anyway - we've done all the RSS feed parsing & downloading, etc.
Regardless, I think the nicest thing to do is change the script to use an --instance-config
option that takes in a yaml file like this:
workers: 3
batch_size: 10
clobber: True
sqlite_out: ferc1.sqlite
metadata_out: metadata.json
datapackage_out: datapackage.json
taxonomy_location: bar.zip
taxonomy_relpath: taxonomy/....
filing_location: foo.zip
filings:
- relpath: abc.xbrl
publication_date: 2023-05-11T01:23:45.678Z
- ...
Which gets parsed into memory as some sort of config object.
Then our existing flags could just be used as shorthand to set that config.
I guess that seems pretty straightforward, so it's probably worth doing.
Parts 1 and 2 of the above list are being fixed here: https://github.com/catalyst-cooperative/ferc-xbrl-extractor/pull/151
The Pydantic settings models make it super easy to do this kind of thing!