catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
489 stars 115 forks source link

Load parquet files to a duckdb file #3739

Open bendnorman opened 4 months ago

bendnorman commented 4 months ago

Superset does not support loading data from sqlite so we want to use duckdb instead! Duckdb is well suited for our data because it's designed to handle local data warehouses. It's also a cheaper option for superset because something like BQ we'd have to pay for query compute costs.

Success Criteria

### Tasks
- [x] Move script from devtools to package and add as entry point. Add to tests
- [x] Run `parquet_to_duckdb.py` script in `gcp_pudl_etl.sh`. Add lines to copy pudl.duckdb to `s3://pudl.catalyst.coop`, `gs://pudl.catalyst.coop` and `gs://superset.catalyst.coop`
- [x] Create a duckdb schema from our metadata classes
- [x] Load all the parquet files into the .duckdb file
- [ ] Figure out why foreign keys slow freeze the loading step. Is it related to the enum issue?
- [ ] Agree on solution to handling enums
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3812
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3811
- [ ] Bonus: Use SQL Expression Language instead of raw text to specify constraints
bendnorman commented 4 months ago

Duckdb table names have a character limit of 63. We have four tables that exceed 63 characters:

core_eiaaeo__yearly_projected_fuel_cost_in_electric_sector_by_type: 66 core_eiaaeo__yearly_projected_generation_in_electric_sector_by_technology: 73 core_eiaaeo__yearly_projected_generation_in_end_use_sectors_by_fuel_type: 72 out_eia923__yearly_generation_fuel_by_generator_energy_source_owner: 67

We should rename these resources, enforce the resource name length constraint earlier in the code and update our documentation.

cmgosnell commented 4 months ago

hhhhmmm long aeo names. @jdangerx made an aeo schema pr before migrating a lot of the AEO tables. I think the trouble here is that there are sooooo many AEO tables and so many of them contain the same pieces of information just broken down by different attributes.

zaneselvans commented 4 months ago

On the topic of making a DuckDB schema with our metadata classes, I'd been thinking we either want to have to_sqlite() and to_duckdb() methods in place of the generic to_sql() we have now (which only creates SQLite schemas) or maybe add dialect="duckdb" and dialect="sqlite" arguments to to_sql() that do the right thing, and have it default to dialect="sqlite" since that's the legacy behavior.

bendnorman commented 4 months ago

Agreed! That's what I'm working on right now. I've added a dialect="duckdb" option to to_sql(). For now, I'm just going to add some if statements to handle the different dialects, but there is probably a cleaner way to store the metadata to SQL logic of multiple dialects. I might make a MetadataSQLConverter class or type.

zaneselvans commented 4 months ago

It might also be possible to use SQLAlchemy for this -- if the checks, constraints, etc can be stated using their generic API, and then output to the appropriate dialect. IIRC there was at least one SQLite specific thing that we had to code manually though.

jdangerx commented 3 months ago

In our inframundo meeting we decided that we can skip the "hard" ones for now and get back to them before we actually release to the public:

zaneselvans commented 2 months ago

Something weird is going on with how big the DuckDB file is. Parquet with snappy compression is expected to be about the same size as the compressed DuckDB file. In Parquet, PUDL only takes up like 1-2GB (minus CEMS), and the DuckDB file is like 13GB, which just seems totally wacked.

bendnorman commented 2 months ago

I think Duckdb uses a different compression algorithm so duckdb files aren't expected to be as small as parquet files: https://github.com/duckdb/duckdb/discussions/8162#discussioncomment-6371454

zaneselvans commented 2 months ago

A factor of 10 feels suspicious though. I searched around for comparisons of the DuckDB and Parquet compression ratios and even a couple of years ago it looked like DuckDB should be less than 2x as big as Parquet.

bendnorman commented 2 months ago

Hmm I thought it could be that we're not specifying varchar lengths but the docs say that shouldn't matter.

It looks like many blocks in our out_eia__monthly_generators table are uncompressed:

D select compression, count(*) as count from pragma_storage_info('out_eia__monthly_generators') group by compression order by count desc;
┌──────────────┬───────┐
│ compression  │ count │
│   varchar    │ int64 │
├──────────────┼───────┤
│ Uncompressed │  4205 │
│ RLE          │  2714 │
│ Constant     │  1719 │
│ Dictionary   │  1281 │
│ FSST         │   722 │
│ ALPRD        │   182 │
│ BitPacking   │    51 │
│ ALP          │    35 │
└──────────────┴───────┘

Not sure why this is or if it's expected.

Another idea: Maybe our indexes are taking up a lot of space?