Open bendnorman opened 4 months ago
Duckdb table names have a character limit of 63. We have four tables that exceed 63 characters:
core_eiaaeo__yearly_projected_fuel_cost_in_electric_sector_by_type
: 66
core_eiaaeo__yearly_projected_generation_in_electric_sector_by_technology
: 73
core_eiaaeo__yearly_projected_generation_in_end_use_sectors_by_fuel_type
: 72
out_eia923__yearly_generation_fuel_by_generator_energy_source_owner
: 67
We should rename these resources, enforce the resource name length constraint earlier in the code and update our documentation.
hhhhmmm long aeo names. @jdangerx made an aeo schema pr before migrating a lot of the AEO tables. I think the trouble here is that there are sooooo many AEO tables and so many of them contain the same pieces of information just broken down by different attributes.
On the topic of making a DuckDB schema with our metadata classes, I'd been thinking we either want to have to_sqlite()
and to_duckdb()
methods in place of the generic to_sql()
we have now (which only creates SQLite schemas) or maybe add dialect="duckdb"
and dialect="sqlite"
arguments to to_sql()
that do the right thing, and have it default to dialect="sqlite"
since that's the legacy behavior.
Agreed! That's what I'm working on right now. I've added a dialect="duckdb"
option to to_sql()
. For now, I'm just going to add some if statements to handle the different dialects, but there is probably a cleaner way to store the metadata to SQL logic of multiple dialects. I might make a MetadataSQLConverter
class or type.
It might also be possible to use SQLAlchemy for this -- if the checks, constraints, etc can be stated using their generic API, and then output to the appropriate dialect. IIRC there was at least one SQLite specific thing that we had to code manually though.
In our inframundo meeting we decided that we can skip the "hard" ones for now and get back to them before we actually release to the public:
Something weird is going on with how big the DuckDB file is. Parquet with snappy compression is expected to be about the same size as the compressed DuckDB file. In Parquet, PUDL only takes up like 1-2GB (minus CEMS), and the DuckDB file is like 13GB, which just seems totally wacked.
I think Duckdb uses a different compression algorithm so duckdb files aren't expected to be as small as parquet files: https://github.com/duckdb/duckdb/discussions/8162#discussioncomment-6371454
A factor of 10 feels suspicious though. I searched around for comparisons of the DuckDB and Parquet compression ratios and even a couple of years ago it looked like DuckDB should be less than 2x as big as Parquet.
Hmm I thought it could be that we're not specifying varchar lengths but the docs say that shouldn't matter.
It looks like many blocks in our out_eia__monthly_generators
table are uncompressed:
D select compression, count(*) as count from pragma_storage_info('out_eia__monthly_generators') group by compression order by count desc;
┌──────────────┬───────┐
│ compression │ count │
│ varchar │ int64 │
├──────────────┼───────┤
│ Uncompressed │ 4205 │
│ RLE │ 2714 │
│ Constant │ 1719 │
│ Dictionary │ 1281 │
│ FSST │ 722 │
│ ALPRD │ 182 │
│ BitPacking │ 51 │
│ ALP │ 35 │
└──────────────┴───────┘
Not sure why this is or if it's expected.
Another idea: Maybe our indexes are taking up a lot of space?
Superset does not support loading data from sqlite so we want to use duckdb instead! Duckdb is well suited for our data because it's designed to handle local data warehouses. It's also a cheaper option for superset because something like BQ we'd have to pay for query compute costs.
Success Criteria
.duckdb
file.duckdb
file is generated & distributed to S3/GCS in nightly builds