catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
465 stars 107 forks source link

Generalize Visual FoxPro / DBF extraction code and separate it from FERC Form 1 #1984

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

We have a bunch of Visual FoxPro / DBF to SQLite extraction / conversion code that lives in the pudl.extract.ferc1 module. Really this is more analogous to the XBRL extractor code, or the ExcelExtractor that we use for the EIA data, and should be split out into its own more generally applicable DBF extractor module.

The data this module should enable us to extract includes:

The forms 1, 2, 6, and 60 are already archived and should be available for programmatic access via pudl_datastore command. We do not anticipate any further updates to the existing DBF data that has been published. All new FERC data is being published using XBRL.

We use a modified version of the dbfread library to access this type of data.

@zschira created a scoping issue in #2335

Note that the old FERC EQR data could be structured very differently from the other form data, and it might be more appropriate to extract that data to CSV or Apache Parquet if it's just a couple of very long tables. This is a stretch goal and might really be an entirely separate project so... if the work doesn't apply directly, that's fine and it should be put off.

- [x] Pull the DBF to SQLite code out of [`pudl.extract.ferc1`](https://github.com/catalyst-cooperative/pudl/blob/main/src/pudl/extract/ferc1.py) into a generic `pudl.extract.dbf` module. (probably do this first get that change merged into `dev` before making a bunch of other changes)
- [x] Factor out the code that's of general utility to extracting all of the above forms.
- [x] Create a configuration or thin wrapper that adapts the generic code to work with specific form.
- [x] Use new system on Form 1
- [x] Use new system on Form 2 (see: #2564)
- [x] Use new system on Form 6 (see: #2595)
- [x] Integrate these extractors into the [`ferc_to_sqlite`](https://github.com/catalyst-cooperative/pudl/blob/main/src/pudl/convert/ferc_to_sqlite.py) script
- [x] Integrate these extractors into the `pudl.ferc_to_sqlite` DAG in Dagster.
- [x] Integrate the newly extracted SQLite DBs into our nightly builds and [Datasette deployment](https://data.catalyst.coop)
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2730
cmgosnell commented 1 year ago

I'd personally suggest separating the migration of the dbf -> sqlite conversion code from the pudl.extract.ferc1 module from the generalize-ing of the dbf extractor.

The former task is mostly copy/paste but will make working with the pudl.extract.ferc1 much more manageable. I'm also excited about the later task but know that form 2 is not on our immediate horizon.

zaneselvans commented 1 year ago

Sure, I agree. Happy to do that on a small independent PR once we get everything merged into XBRL integration.

zaneselvans commented 1 year ago

@zschira Not a lot here but related to #2335

zaneselvans commented 1 year ago

@rousik As I said to @zschira there's not a whole lot here, but this is the issue we were chatting about.

I wrote the original code and am happy to flesh this out more if that would be useful.

zaneselvans commented 1 year ago

@rousik asks:

I'm suspecting that we might want to do some cleanup of the FERC 2 table names, see https://github.com/catalyst-cooperative/pudl/blob/rousik-ferc2/src/pudl/package_data/ferc2/table_file_map.csv - e.g. f2_117_cmpinc_hedge seems to contain lot of excess useless information that just clutter the things. Thoughts?

My recollection is that the tables have names that are defined in the FoxPro DB independent of the filenames (the differed significantly from the filenames for FERC 1). Do the filenames just happen to match the names of the tables that are stored inside the database? Or are they being derived from the filenames? I don't think changing the table names in the translation from DBF to SQLite is a great idea, since if there's any documentation out there in the world explaining what's in the old DBs, it would apply pretty directly to the translated SQLite DB.

When we take the data from the translated DB and do more extensive transformations and combine it with the more recent XBRL data to create new PUDL DB tables, we'll give it a longer more readable table name.

For FERC 1 with the DBF to SQLite translation we tried to keep all of the data without discriminating, since otherwise nobody can access it, and we weren't 100% sure what would be useful and what wouldn't. However, early on there were two unusual tables that seemed to contain binary data (like they’d jammed PDFs or word docs into the DB or something) which also made up ~90% of all the overall bulk of the database, and we skipped them in the translation to SQLite. FERC later revised all the old data archives, removing all of that mysterious data!

rousik commented 1 year ago

@rousik asks:

I'm suspecting that we might want to do some cleanup of the FERC 2 table names, see https://github.com/catalyst-cooperative/pudl/blob/rousik-ferc2/src/pudl/package_data/ferc2/table_file_map.csv - e.g. f2_117_cmpinc_hedge seems to contain lot of excess useless information that just clutter the things. Thoughts?

My recollection is that the tables have names that are defined in the FoxPro DB independent of the filenames (the differed significantly from the filenames for FERC 1). Do the filenames just happen to match the names of the tables that are stored inside the database? Or are they being derived from the filenames? I don't think changing the table names in the translation from DBF to SQLite is a great idea, since if there's any documentation out there in the world explaining what's in the old DBs, it would apply pretty directly to the translated SQLite DB.

For FERC Form 2, it seems that the table names match filenames (tables are lower case, filenames all uppercase). I cxan leave as-is.

For FERC 1 with the DBF to SQLite translation we tried to keep all of the data without discriminating, since otherwise nobody can access it, and we weren't 100% sure what would be useful and what wouldn't. However, early on there were two unusual tables that seemed to contain binary data (like they’d jammed PDFs or word docs into the DB or something) which also made up ~90% of all the overall bulk of the database, and we skipped them in the translation to SQLite. FERC later revised all the old data archives, removing all of that mysterious data!

I haven't inspected the data itself, but I will take a look to see what's being exported to sqlite and if I can glean some meaning from those.

zaneselvans commented 1 year ago

In what sense are the contents of f2_117_cmpinc_hedge messy?

Really all we are trying to do in this step is translate the data into a modern format that's easy to access. We happen to be able to easily get all the annual DBs into a single multi-year DB which is nice, but all of the cleaning, reshaping, renaming etc. takes place when we extract from the FERC SQLite DB and try to integrate it into PUDL.

zaneselvans commented 1 year ago

@rousik Do you feel like the tasks in here map to stuff that actually needs to get done to get us to having the additional DBF data translated to SQLite for FERC 2, 6, & 60? Is there enough work in those sub-tasks that we want to break them out into their own issues? It seems like there will probably be significant quirks to manage in the individual datasets.

e-belfer commented 1 year ago

Moved two quality issues into #2748, closing this issue.