catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
469 stars 108 forks source link

Update `ferc1_to_sqlite` to consume both XBRL & DBF #1668

Closed zschira closed 2 years ago

zschira commented 2 years ago

Background

The first step to integrating XBRL data into PUDL is being able to generate the raw SQLite DB from archived XBRL filings. The ferc1_to_sqlite script can be updated to handle this, as well as still producing the raw DB from the historical VFP based filings.

Design

There will be a separate settings object for managing XBRL data. This will allow the user to independently specify the years/tables they are interested in for both data formats. The script will then handle these accordingly, and produce both raw databases.

Tasks

cmgosnell commented 2 years ago

is this finished? i thiiink bc i used it :-)

zaneselvans commented 2 years ago

Some issues I've had trying to use this:

It seems like it should be possible to run just the DBF or just the XBRL conversions, and not have them interfere with each other since they create two separate databases, and read from two different sets of inputs. If there are only years (or tables) for one, then it should only do that conversion.

zaneselvans commented 2 years ago

@zschira In the new settings file I see two separate sections for XBRL and DBF inputs, and then also two separate Settings objects internally, for XBRL and DBF sources. But the XBRL settings appear to be inverting what is specified in the DBF settings (and the EIA and other data source settings). In the DBF settings we're saying what outputs we want to get (in terms of the PUDL tables and years of data) rather than what inputs we want processed (in terms of the FERC 1 SQLite tables and years).

It seems like it would be simpler to adopt the desired outputs as what is specified, and store the logic as to which SQLite tables need to be processed to generate those outputs within the extract / transform module logic.

That way the user doesn't need to to think about which data comes from which source, and we only have one kind of settings object to track. We'll already need to have logic that understands which source to go to for what years of data anyway.

Is there some other reason why the settings / params got flipped for XBRL that I'm not seeing?

zaneselvans commented 2 years ago

See also #1830

@zschira if you want to subsume this issue inside that epic and close it that would be fine.