Update `ferc1_to_sqlite` to consume both XBRL & DBF

zschira commented 2 years ago

Background

The first step to integrating XBRL data into PUDL is being able to generate the raw SQLite DB from archived XBRL filings. The ferc1_to_sqlite script can be updated to handle this, as well as still producing the raw DB from the historical VFP based filings.

Design

There will be a separate settings object for managing XBRL data. This will allow the user to independently specify the years/tables they are interested in for both data formats. The script will then handle these accordingly, and produce both raw databases.

Tasks

[x] Update Ferc1Settings object to specify output PUDL tables only
[x] Allow extraction of a subset of XBRL tables
[x] Allow extraction of one or the other or both XBRL & DBF inputs gracefully depending on settings

cmgosnell commented 2 years ago

is this finished? i thiiink bc i used it :-)

zaneselvans commented 2 years ago

Some issues I've had trying to use this:

I have an old DBF based ferc1.sqlite, and I want to create a new XBRL based ferc1_xbrl.sqlite. I commented out the years in the DBF settings, expecting that it would only run the XBRL conversion, but it complained about the old FERC 1 DB existing and not wanting to clobber it, even though it shouldn't have been in the way of the new DB.
After moving aside the DBF based DB, I tried to run it again, but then even though I had no years in the settings file for the DBF based conversion, it still attempted to extract DBF data, and failed.
If I remove the DBF settings entirely from the YAML file it still fails.
With 2020 selected for the DBF and 2021 for the XBRL it seems to be running...

It seems like it should be possible to run just the DBF or just the XBRL conversions, and not have them interfere with each other since they create two separate databases, and read from two different sets of inputs. If there are only years (or tables) for one, then it should only do that conversion.

zaneselvans commented 2 years ago

@zschira In the new settings file I see two separate sections for XBRL and DBF inputs, and then also two separate Settings objects internally, for XBRL and DBF sources. But the XBRL settings appear to be inverting what is specified in the DBF settings (and the EIA and other data source settings). In the DBF settings we're saying what outputs we want to get (in terms of the PUDL tables and years of data) rather than what inputs we want processed (in terms of the FERC 1 SQLite tables and years).

It seems like it would be simpler to adopt the desired outputs as what is specified, and store the logic as to which SQLite tables need to be processed to generate those outputs within the extract / transform module logic.

That way the user doesn't need to to think about which data comes from which source, and we only have one kind of settings object to track. We'll already need to have logic that understands which source to go to for what years of data anyway.

Is there some other reason why the settings / params got flipped for XBRL that I'm not seeing?

zaneselvans commented 2 years ago

catalyst-cooperative / pudl