catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
458 stars 105 forks source link

Archive a complete FERC Form 1 DB including all footnotes with PUDL v0.4.0 #1143

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

We don't include a few very large binary-ish tables from the original ferc1 DB in the distributed PUDL data, since they increase the size of the DB by 10x, and are of very little use to anyone as is. But they should be archived in case someone does want to access and parse them.

Generate a full ferc1 DB including those strange tables, and archive it in our existing FERC 1 DB archive on Zenodo.

zaneselvans commented 2 years ago

I attempted to generate the full database, including the usually excluded tables which have historically contained a lot of binary data:

However, the database that was generated was small -- the same size as without these tables. I opened it up and found that while the tables were present, the individual columns which had contained the binary blobs of data were all null -- they had records, but the single column that had contained all of the data previously were 100% null. Seems weird.

I re-downloaded the ferc1-input-data.tgz from our v1.0.0 FERC 1 DB archive on Zenodo on Zenodo and found that it's about twice as large as the current one, even though the current one contains an additional year. Zipped, the old archive is 2.5GB, while the new one is only 1.2GB. So there's definitely less data in the newer files from FERC.

Another possibility is that in our tweaking of the catalyst-cooperative/dbfread library to make it capable of reading out of zip archives, we messed something up about how referenced data that's somehow "outside" of the DBF table is read and incorporated into the database.

Not sure what to do here. As it is there's no reason to make a separate FERC 1 DB archive since all the existing data is already archived in the PUDL data release v2.0.0. But we should figure out what happened to all this data since it might be useful. Also presumably the FERC Form 2 DB will have similar embedded binary data that we might want to be rescuing.

zaneselvans commented 2 years ago

Regardless of what the cause of this issue is, it looks like there's no point to doing a separate FERC Form 1 database release using PUDL v0.4.0. I will close this issue and create a new one for debugging what is going on with this now apparently missing data.