catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
478 stars 110 forks source link

FERC Form 1 Statement of Income Contains Holes #2755

Closed ncrowley13 closed 1 year ago

ncrowley13 commented 1 year ago

Describe the bug

Upon obtaining the .csv FERC Form 1 files derived from XBRL data using PUDL, we find that some fields have missing data that should not be missing. For example, the file "statement_of_income_114_duration.csv" has missing data in the field "net_interest_charges," but when we check the PDF FERC Form 1 document, the field is not missing. We do not know the full extent of this issue, but has impacted our work.

Bug Severity

How badly is this bug affecting you? -High/Medium: we basically can't use the data from PUDL if we don't trust that it's accurate. Our alternative right now is manually entering data from PDF files for data since the XBRL shift (2021 and 2022).

To Reproduce

I don't need to provide code--it's simply an issue inside the CSV files in PUDL.

Expected behavior

I expected there to be data in the field "net_interest_income," but there is not.

Software Environment?

-Windows

Additional context

(none).

zaneselvans commented 1 year ago

Hey @ncrowley13 thanks for bringing this to our attention.

Can you give us some examples of individual utilities that are displaying this issue?

Also where are you getting the individual utility filing PDFs that you're comparing the data against?

Looking at the processed FERC 1 data in the PUDL DB, I am also seeing only a very small number of records for this table and income type. See this query

SELECT * FROM income_statement_ferc1
WHERE
  "income_type" = "net_interest_charges"
  AND "report_year" >= 2021
ORDER BY
  utility_id_ferc1,
  report_year,
  utility_type,
  income_type

This does seem to correspond to the data that's available in the XBRL derived raw database. Query here

SELECT
  entity_id,
  start_date,
  end_date,
  utility_type_axis,
  net_interest_charges
FROM
  statement_of_income_114_duration
WHERE
  net_interest_charges not null
  AND end_date = "2021-12-31"

Which means that if there's reported data that's being lost, it must be getting lost earlier in the process.

It's also possible that there were additional FERC Form 1 filings that came in after we archived the 2021 data from their RSS feed. We're working on automating that archiving process now so that it can happen every month rather then just when we do our big annual updates.

@zschira built the system that converts the XBRL filings to SQLite so he'd probably be best positioned to debug an issue in that conversion process if one exists!

ncrowley13 commented 1 year ago

Thanks for the response. As you suggest, most companies in the data have this issue--and it seems like the issue exists for a lot of fields, at least on the Statement of Income. In reality, every utility has interest charges, but it appears only a handful are showing up in your data with interest charges.

FERC Form 1 PDFs can be obtained for any company and virtually any year at https://elibrary.ferc.gov/eLibrary/search, though they are now being published in HTML format. If you find Alabama Power's 2021 Form 1 (chosen because it's alphabetically first in my list of utilities), and locate the statement of income, you'll see there's a value in the field for Net Interest Charges even though it does not appear in the PUDL csv. I'm not sure if this is what you are asking for. I would be very impressed if you did all this work on the data but never actually looked at any actual utility filings! image

Could this be an issue for basically every utility reporting their data? Or somehow on the FERC end?

zaneselvans commented 1 year ago

We frequently refer to the blank forms since they've got explanations about what the data is but we've never really looked at the PDF filings (at least, not since the first time we compiled this data manually for Xcel Energy in CO back in like, 2013 which is partly what inspired us to do it more programmatically). We've mostly just referred to the data in the old DBF databases since that's what was being used to generate the old PDFs with FERC's Form Viewer application.

I am pretty positive that we've got Alabama Power data from the Form 1 for 2021, and it would be very strange if they had made a filing, but only part of it was showing up in the RSS feed. Unless the frequent filing amendments that happen aren't complete re-filings, and only include the subset of the schedules which are being updated. My god that would be a nightmare.

What I think we'll probably do is:

ncrowley13 commented 1 year ago

Thanks for the quick response. I will mention that we grabbed all the 2021 XBRL data from ferc.gov and performed our own version of what the folks at Catalyst Coop are doing, and we found the same issue. So, either we're both making the same errors, or there's a problem happening further up the chain, as you suggest.

If the problem is happening further up the chain--ie if FERC has a problem with their website or every company filing is messed up somehow, does Catalyst Coop have any contact with FERC staff to point out this problem?

[General complain zone: I probably sound like an old fogey, but FERC's switch to XBRL from its old Foxpro data base seems like a major mistake so far. It's unbelievably more difficult to get the data compared to the old days 2+ years ago. I used to be able to get every single field for every company and every year into a STATA file in a single morning.]

zaneselvans commented 1 year ago

Ugh, it sounds like it's probably an upstream problem with FERC's data or publishing mechanism. We have some contacts there and if we can identify the issue specifically hopefully we can get someone to fix it, or explain how we're supposed to be getting the data from the RSS feed such that this problem doesn't happen.

They really had to get off of DBF because the file format was no longer supported (Microsoft EOL was in like 2009), but I agree XBRL was not a great choice when it could have just been a modern database. I think they kind of got swindled by the XBRL people, who have a "non-profit" that helps agencies switch to using XBRL, and then coincidentally also sells subscriptions to their own relational database version of the data that the agencies publish in XBRL for an exorbitant fee.

We've been working on the XBRL integration continuously for the last 1.5 years. It's truly a mess.

ncrowley13 commented 1 year ago

Yeah, we used to joke about the FoxPro database in my office, but little did I know how good I had it.

I really appreciate that Catalyst Coop is working on this for the rest of us. The acquisition of "free" data shouldn't require weeks of work and computer science background. I'm just trying to do so basic industry analysis over here!

zaneselvans commented 1 year ago

Welp, Alabama Power does appear to have Net Interest Charges based on the eLibrary version.

image

And there plenty of other value reported for Alabama Power in 2021 in the income statement table. This is a narrow gap.

SELECT
  *
FROM
  denorm_income_statement_ferc1
WHERE
  "utility_name_ferc1" LIKE "%ALABAMA POWER%"
  AND "report_year" = 2021
ORDER BY
  utility_id_ferc1,
  report_year,
  utility_type,
  income_type

Query results

zschira commented 1 year ago

I've been diving into the raw XBRL filings, and my findings are perplexing. I've looked at the Alabama Power Co. filing, and the missing values for Net Interest Charges do show up there, however, the Context they are associated with is missing the UtilityType Dimension.

Raw values:

<ferc:NetInterestCharges
id="ad3VybDovL2RvY3MudjEvZG9jOjViNmM5MTU5Zjc4MjRlZGE4Mjg2Y2M5MTIxN2Q1ZWJhL3RhYmxlOjJjN2I0ZjNhNmYyMTRkODBiMGEyNTE2NmQ0ZGFhMjFmL3RhYmxlY2VsbDoyYzdiNGYzYTZmMjE0ZDgwYjBhMjUxNjZkNGRhYTIxZjAwMDAwMDAwMDYwMDAwMDAwMDgzOTIyMzM3MjAzNjg1NDcxMTY4Ng_cell-10-G84-70-c"
contextRef="i829d007c69904628b0a3c13b5468ad19_D20210101-20211231" decimals="0"
unitRef="i7b5280ae4bc442e496db1529a6c4004e">
    339557616
</ferc:NetInterestCharges>

<ferc:NetInterestCharges
id="ad3VybDovL2RvY3MudjEvZG9jOjViNmM5MTU5Zjc4MjRlZGE4Mjg2Y2M5MTIxN2Q1ZWJhL3RhYmxlOjJjN2I0ZjNhNmYyMTRkODBiMGEyNTE2NmQ0ZGFhMjFmL3RhYmxlY2VsbDoyYzdiNGYzYTZmMjE0ZDgwYjBhMjUxNjZkNGRhYTIxZjAwMDAwMDAwMDcwMDAwMDAwMDgzOTIyMzM3MjAzNjg1NDcxMTY4Ng_cell-10-H84-70-d"
contextRef="i67b6c47d29d441139c7d0bd1f2667fea_D20200101-20201231" decimals="0"
unitRef="i7b5280ae4bc442e496db1529a6c4004e">
    338212980
</ferc:NetInterestCharges>

Corresponding contexts:

  <xbrli:context id="i829d007c69904628b0a3c13b5468ad19_D20210101-20211231">
    <xbrli:entity>
      <xbrli:identifier scheme="http://www.ferc.gov/CID">C001552</xbrli:identifier>
    </xbrli:entity>
    <xbrli:period>
      <xbrli:startDate>2021-01-01</xbrli:startDate>
      <xbrli:endDate>2021-12-31</xbrli:endDate>
    </xbrli:period>
  </xbrli:context>

  <xbrli:context id="i67b6c47d29d441139c7d0bd1f2667fea_D20200101-20201231">
    <xbrli:entity>
      <xbrli:identifier scheme="http://www.ferc.gov/CID">C001552</xbrli:identifier>
    </xbrli:entity>
    <xbrli:period>
      <xbrli:startDate>2020-01-01</xbrli:startDate>
      <xbrli:endDate>2020-12-31</xbrli:endDate>
    </xbrli:period>
  </xbrli:context>

This means when our extractor is parsing the XBRL filing, it has no way to know that these values belong in the table. This is because a Concept name like NetInterestCharges can show up in multiple tables (this one specifically does not, but many do), so without the Dimension it is ambiguous where a value should end up in the extracted DB.

To add to the weirdness, I've also spot checked the DepreciationExpense values, which do show up in our DB. These values for some reason have duplicates. One with the same ambiguity issue as NetInterestCharges, and one that contains the UtilityType Dimension:

<ferc:DepreciationExpense id="ad3VybDovL2RvY3MudjEvZG9jOjViNmM5MTU5Zjc4MjRlZGE4Mjg2Y2M5MTIxN2Q1ZWJhL3RhYmxlOjJjN2I0ZjNhNmYyMTRkODBiMGEyNTE2NmQ0ZGFhMjFmL3RhYmxlY2VsbDoyYzdiNGYzYTZmMjE0ZDgwYjBhMjUxNjZkNGRhYTIxZjAwMDAwMDAwMDYwMDAwMDAwMDE3OTIyMzM3MjAzNjg1NDcxMTY4Ng_cell-10-G18-6-c"
contextRef="i829d007c69904628b0a3c13b5468ad19_D20210101-20211231" decimals="0"
unitRef="i7b5280ae4bc442e496db1529a6c4004e">
    724504894
</ferc:DepreciationExpense>

<ferc:DepreciationExpense
id="ad3VybDovL2RvY3MudjEvZG9jOjViNmM5MTU5Zjc4MjRlZGE4Mjg2Y2M5MTIxN2Q1ZWJhL3RhYmxlOjJjN2I0ZjNhNmYyMTRkODBiMGEyNTE2NmQ0ZGFhMjFmL3RhYmxlY2VsbDoyYzdiNGYzYTZmMjE0ZDgwYjBhMjUxNjZkNGRhYTIxZjAwMDAwMDAwMTAwMDAwMDAwMDE3OTIyMzM3MjAzNjg1NDcxMTY4Ng_cell-10-K18-6-g"
contextRef="ie4dc47e8bc1643a28e7550820e8d3f93_D20210101-20211231" decimals="0"
unitRef="i7b5280ae4bc442e496db1529a6c4004e">
    724504894
</ferc:DepreciationExpense>

<ferc:DepreciationExpense
id="ad3VybDovL2RvY3MudjEvZG9jOjViNmM5MTU5Zjc4MjRlZGE4Mjg2Y2M5MTIxN2Q1ZWJhL3RhYmxlOjJjN2I0ZjNhNmYyMTRkODBiMGEyNTE2NmQ0ZGFhMjFmL3RhYmxlY2VsbDoyYzdiNGYzYTZmMjE0ZDgwYjBhMjUxNjZkNGRhYTIxZjAwMDAwMDAwMDcwMDAwMDAwMDE3OTIyMzM3MjAzNjg1NDcxMTY4Ng_cell-10-H18-6-d"
contextRef="i67b6c47d29d441139c7d0bd1f2667fea_D20200101-20201231" decimals="0" unitRef="i7b5280ae4bc442e496db1529a6c4004e">
    693129897
</ferc:DepreciationExpense>

<ferc:DepreciationExpense
id="ad3VybDovL2RvY3MudjEvZG9jOjViNmM5MTU5Zjc4MjRlZGE4Mjg2Y2M5MTIxN2Q1ZWJhL3RhYmxlOjJjN2I0ZjNhNmYyMTRkODBiMGEyNTE2NmQ0ZGFhMjFmL3RhYmxlY2VsbDoyYzdiNGYzYTZmMjE0ZDgwYjBhMjUxNjZkNGRhYTIxZjAwMDAwMDAwMTEwMDAwMDAwMDE3OTIyMzM3MjAzNjg1NDcxMTY4Ng_cell-10-L18-6-h"
contextRef="i0ac1385c17ec45bea1f7be3c6e94840e_D20200101-20201231" decimals="0"
unitRef="i7b5280ae4bc442e496db1529a6c4004e">
    693129897
</ferc:DepreciationExpense>

Looking at the new contexts introduced here, we see that they do contain the Axis in question:

  <xbrli:context id="ie4dc47e8bc1643a28e7550820e8d3f93_D20210101-20211231">
    <xbrli:entity>
      <xbrli:identifier scheme="http://www.ferc.gov/CID">C001552</xbrli:identifier>
      <xbrli:segment>
        <xbrldi:explicitMember dimension="ferc:UtilityTypeAxis">ferc:ElectricUtilityMember</xbrldi:explicitMember>
      </xbrli:segment>
    </xbrli:entity>
    <xbrli:period>
      <xbrli:startDate>2021-01-01</xbrli:startDate>
      <xbrli:endDate>2021-12-31</xbrli:endDate>
    </xbrli:period>
  </xbrli:context>

  <xbrli:context id="i0ac1385c17ec45bea1f7be3c6e94840e_D20200101-20201231">
    <xbrli:entity>
      <xbrli:identifier scheme="http://www.ferc.gov/CID">C001552</xbrli:identifier>
      <xbrli:segment>
        <xbrldi:explicitMember dimension="ferc:UtilityTypeAxis">ferc:ElectricUtilityMember</xbrldi:explicitMember>
      </xbrli:segment>
    </xbrli:entity>
    <xbrli:period>
      <xbrli:startDate>2020-01-01</xbrli:startDate>
      <xbrli:endDate>2020-12-31</xbrli:endDate>
    </xbrli:period>
  </xbrli:context>

I'm going to send some questions to Ferc regarding this issue, in the past Robert Hudson has been quite responsive with XBRL related questions, so hopefully we'll hear back soon. I'll also continue to read more on the XBRL standard to see if there's some weird quirk that would allow us to handle these ambiguous values.

ncrowley13 commented 1 year ago

Thanks a lot. I have also sent a message to the generic FERC helpdesk, but have not heard back. I suspect Catalyst Coop will be better able to get at the heart of the issue, since you have a better handle on the mechanics of XBRL.

Please let me know when you learn, or if they initiate some effort to change things!

zschira commented 1 year ago

@ncrowley13 we'll make sure to keep you in the loop!

Also, we're working on identifying as many gaps in the data as possible to help debug/fix the problem, so if you can highlight any other obvious issues you've run into, it might help us get this figured out.

zschira commented 1 year ago

With help from @jdangerx I've done a deep dive into all of the reported facts in 2021 that don't end up anywhere in our DB. In all of the filings we've archived in 2021 (minus a couple that had formatting differences that made them difficult to use in the analysis), there were 2,196,782 facts reported. Of these, ~7% don't end up in our DB anywhere. Of that 7%, about a quarter were OrderNumber which is only used by FERC for formatting, and I've ignored them. Of the remaining facts, ~86% of them follow the exact same pattern as the NetInterestCharges that I investigated yesterday. By this I mean they do not have a UtilityType associated with them, but if they did, it would be unambiguous where they belong, and they would end up in a table. From looking at some of FERC's html renderings of submitted forms, it seems that they treat values like this as a total across utility types.

Given this information, I believe we could easily modify our extractor to handle these facts the same way FERC is, and this alone would account for the vast majority of these data holes. There would still be a couple thousand "lost facts" (less than 1% of the total submitted) that would need further investigation. My hunch is that many of these might be one off filing inconsistencies/errors, but I haven't had a chance to look at these deeply yet.

jdangerx commented 1 year ago

Great digging, @zschira ! I think we should set some thresholds for how many lost facts we should accept. Certainly the threshold is higher than 0%. I think the next steps are to spend a little more time investigating to see if there are other large categories of lost facts we can knock off - and if not, we could probably just set the threshold at 1% to detect any new large categories.

zaneselvans commented 1 year ago

Is it possible that any of the facts that do end up in the database already might still be missing some dimensional information? We're depending on having values for utility type, plant status, and plant function to be able to encode and reproduce calculations. If it is possible, is there any way that we could identify those records or correct them?

If a large majority of filings include all the expected information, I wonder if there's any way that we could identify outliers that don't match the prevailing pattern.

ncrowley13 commented 1 year ago

It sounds as though Catalyst can rectify a large portion of the data issues that currently exist in the Form 1. If so, that's great! At the risk of getting ahead of things, do you have a guess as to when the 2022 Form 1 data might be available through PUDL?

zaneselvans commented 1 year ago

@jdangerx is looking into the differences between the 2021 and 2022 taxonomies right now. It's the first time we will have been integrating more than 1 year of XBRL data together so I suspect there will be some growing pains. The raw extracted SQLite DB should happen within the next month hopefully. Getting all the new data processed and normalized will probably take longer.

zschira commented 1 year ago

Is it possible that any of the facts that do end up in the database already might still be missing some dimensional information

I believe all of the facts missing dimensional information will end up with these same ambiguity issues, so I don't think any of them will end up in the database.

If a large majority of filings include all the expected information, I wonder if there's any way that we could identify outliers that don't match the prevailing pattern.

So far it seems like a little under half of the 2021 filings have at least 1 missing fact. I'd guess there's some more patterns we can pick out there (probably some common filing software that leads to similar problems). @jdangerx and I started working on an integration test that will allow us keep track of the number of lost facts, and collect some stats on them.

campbellpryde commented 1 year ago

On the form 1 statement of income any rows after 30 are not broken down by the utility type. The form actually has this data blacked out. So the interest items from row 62 through 70 will never have the utility axis. The values are reported as an actual total across electric, gas and other utility types.

For the 2021 filing for Alabama Power available here: https://ecollection.ferc.gov/api/DownloadDocument/142219/3?filename=wk-20211231.xm_90098.html you can look at schedule 114-117 and the value for line 70 you will see it is blacked out, so no XBRL data is available for this. Under foxPro they may have been able to file data, but this was not expected.

For Alabama Power data before 2021 the data was converted from FoxPro to an XBRL format, but only the expected data was transformed. Be aware this earlier data in XBRL will not be as clean as more recent data as the fox pro data had limited error checking.

Any data above row 30 must have utility data. If you see any that don't I would be interested in seeing these.

cmgosnell commented 1 year ago

@campbellpryde thanks this is very helpful! Do you know of other instances of the xbrl tables having mixed or different axis columns? It seems like in the transition a lot of the pages in the form that have multiple schemas got broken out into multiple tables - which for the record has been really helpful.

For our purposes while transforming the data once its getting properly slurped into the sqlite db, this suggests we may need to selectively fill in utility type as total because we use utility type as a primary key for this table.

campbellpryde commented 1 year ago

If you know they only have one utility type, you could add the utility type to all the data that is reported as a total across all utilities.

Generally speaking an axis does not define a FERC schedule. A schedule can be comprised of no axis at all or in some case up to 3 axis. The values with no axis represent that total value across all the axis. So in the case of the income statement if you know they only report data as an electric utility then all facts with no axis at all are going to be applicable to the electric utility.

To get all the data in the income statement you should search for those concepts that appear in the statement, and return those facts with the utility type axis and no axis at all. https://xbrlview.ferc.gov/yeti/resources/yeti-gwt/Yeti.jsp#tax~(id~8*v~110)!con~(id~151444)!net~(a~143*l~35)!lang~(code~en)!path~(g~3849*p~0)!rg~(rg~4*p~1)

The underlying logic to all of this, is that a fact reported in a schedule will be defined exactly the same way across the entire form 1. This means the facts are independent of the schedule. One fact could appear in multiple schedules. This means that if you search for MaintenanceExpense, the same element is used in schedule 114 and 419.

Unfortunately, some facts that you would expect to be the same between the financials and the supporting schedules are not because of adjustments made by the utilities so the values do not match in all cases such as DepreciationExpense and DepreciationExpenseExcludingAdjustments. (In the vast majority the value of these 2 will be the same)

zschira commented 1 year ago

Hi @campbellpryde, thanks for the insight! Just to clarify, it sounds like we can assume that any time there is a fact with a missing axis, we should treat that fact as a total across the axis? And, does this imply that if there's a schedule with multiple axes and we see a fact missing 2 or more axes, that fact should be treated as a total across all of those axes? For example, for the schedule 262 Taxes Accrued, Prepaid and Charged During Year, Distribution of Taxes Charged, I've observed facts that are missing both the UtilityTypeAxis and the KindOfTaxAxis.

And, out of curiosity, is this behavior true across the board for XBRL, or is this a nuance of the FERC XBRL taxonomy?

Thanks again for the help!

campbellpryde commented 1 year ago

..we can assume that any time there is a fact with a missing axis, we should treat that fact as a total across the axis? -> YES And, does this imply that if there's a schedule with multiple axes and we see a fact missing 2 or more axes, that fact should be treated as a total across all of those axes? YES ..I've observed facts that are missing both the UtilityTypeAxis and the KindOfTaxAxis. -> This schedule has 3 axis. One axis breaks each of the rows down by the actual tax incurred. (Corresponds to each row that is not a total or subtotal) This is the TaxesAccruedPrepaidAndChargedAxis. Each of these has a subtotal for king of tax which uses the KindOfTaxAxis. If it only has the KindOfTaxAxis this means it is the subtotal value for that kind of tax. Those values with no axis is the total for each of the columns. Col I on this schedule uses the utility type axis. This is because these items should tie back to the income statement. I.e the subtotal for federal tax for col I should be the same as the row 15 on the income statement for col g, which also has the utility type axis.

Generally speaking an axis or a dimension is used in XBRL to disaggregate a report wide value down by a specific property or attribute.