Output PUDL as Parquet as well as SQLite

zaneselvans commented 11 months ago

We've heard feedback from users that the monolithic 12GB PUDL SQLite DB is a bit chonky for download an integration into other workflows when they only need a couple of tables, and that some kind of remote, cloud-optimized outputs would be very handy. One way to address these needs is to publish Parquet files representing the tables in our DB.

Pros of Parquet:

can be read remotely very efficiently due to their chunked row-groups and columnar format.
compressed internally so take up very little space (around 3 GB for everything in the uncompressed 12 GB SQLite DB)
provide rich PyArrow data types that can be read into Pandas, Dask, and other dataframe libraries without requiring any additional schema information.
are platform independent and can be used by many tools
optimized for fast read/write and analytical workflows
can be queried like database tables using DuckDB

Cons of Parquet:

no foreign key relationships and much less in the way of constraints that ensure everything is internally consistent
will necessitate a large directory of ~200 files, which users could easily end up mixing and matching from different versions, resulting in internal incompatibilities.
We don't currently know how or if we can collect enough information about what data is being used, by who, and for what purpose through our completely open, file-based data distribution system. AWS S3 offers a bewildering array of logging tools, which we haven't really explored. See: logging with S3, CloudTrail Logging, CloudWatch Monitoring, ServerLogs

On balance, we've decided that this would be a valuable addition to the data we publish, and should be relatively straightforward given the infrastructure we already have in place.

Current Situation:

After merging #3222:

We can create a pyarrow schema for and output any PUDL Resource as a Parquet file.

The pyarrow schema uses this mapping of simplified data types:

FIELD_DTYPES_PYARROW: dict[str, pa.lib.DataType] = {
"boolean": pa.bool_(),
"date": pa.date32(),  # date with a resolution of days.
"datetime": pa.timestamp("ms", tz="UTC"),
"integer": pa.int32(),
"number": pa.float32(),
"string": pa.string(),
"year": pa.int32(),
}

For fields of type string that also have ENUM constraints we create a categorical with:
```
pa.dictionary(pa.int32(), pa.string(), ordered=False)
```
We set the nullable attribute of the pyarrow field with:
```
nullable=(not self.constraints.required)
```
ALL 190 Resource definitions combined only add up to ~1.2GB of snappy compressed Parquet output (excluding the ~5GB, billion-row epacems table).

Low Hanging Fruit

Only a little bit of work is required to start generating Parquet outputs in our nightly builds so we can experiment with them internally and provide private beta access to a few users like @arengel & @jrea-rmi at RMI and @grgmiller or @gschivley
After some internal discussion, we're not yet ready to commit to publishing Parquet files via the AWS Open Data Registry, because we want to explore using a thin registration façade that would allow us to better understand who our users are and how they are using the data, and potentially connect table-level data usage to individual users, while still providing free and open data access.

- [x] Test `to_pyarrow()` schema generator on all defined Resources, identify problems.
- [x] In a notebook, try to write all Resources with working PyArrow schemas to Parquet, identify problems.
- [x] Address problems identified in PyArrow schema generation and Parquet output above.
- [ ] #3224
- [x] Once #3224 is finished, revert `to_pyarrow()` workarounds.
- [ ] #3296
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3362

See #3246 for some more potential subsequent Parquet optimizations.

zaneselvans commented 10 months ago

Initial Exploration

Overall the initial exploration was positive:

Of our 200 Resources 123 have no trouble creating a PyArrow schema.
There are only two kinds of errors, which require investigation:
- expected bytes, NoneType found (this is due to fields or tables lacking a description field)
- can only join an iterable (this is due to some table lacking a primary_key)
Those 123 assets with schemas are also able to be written out to Parquet.
This 2/3 of the tables only accounts for ~1/3 of the data (we expect about 3GB total, and it makes 800MB)
The PyArrow schemas we're currently creating don't take into account ENUM and NULLABLE or other constraints.
There's no internal row-group blocking so large tables wouldn't be as efficient to access as they could be.

File ~/code/catalyst/pudl/src/pudl/metadata/classes.py:623, in Field.to_pyarrow(self)
    621 def to_pyarrow(self) -> pa.Field:
    622     """Return a PyArrow Field appropriate to the field."""
--> 623     return pa.field(
    624         name=self.name,
    625         type=self.to_pyarrow_dtype(),
    626         nullable=(not self.constraints.required),
    627         metadata={"description": self.description},
    628     )

File ~/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pyarrow/types.pxi:3394, in pyarrow.lib.field()
File ~/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pyarrow/types.pxi:2053, in pyarrow.lib.ensure_metadata()
File ~/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/pyarrow/types.pxi:1919, in pyarrow.lib.KeyValueMetadata.__init__()
File <stringsource>:15, in string.from_py.__pyx_convert_string_from_py_std__in_string()
TypeError: expected bytes, NoneType found

zaneselvans commented 10 months ago

See also #1379

zaneselvans commented 10 months ago

I was able to fix all of the Resource.to_pyarrow() issues pretty easily! It was all due to:

incomplete metadata (no descriptions)
an assumption by to_pyarrow() that all tables would have a primary_key which is not the case.

After minor tweaks to avoid these issues, we can create (basic) PyArrow schemas for all of our tables, and reading them out of the PUDL DB, we can write them out as Parquet files:

190 files
1.2GB on disk (more than 10x compression compared to the full SQLite DB, and more than 2x compression vs. gzipped DB)
Only 2 tables are larger than 100MB (and I'm suspicious one of them is bigger than it should be)
Only 20 tables are larger than 10MB.
Reading everything from SQLite and writing to Parquet in a single process took 17 minutes.
Note that this output doesn't use categorical columns, value constraints, or produce any kind of organized internal row groups.

jdangerx commented 10 months ago

This is great!

It sounds like you want to get the low-hanging fruit as soon as we can, and then ponder the medium-hanging fruit at some slightly later date - if that's your plan I think it makes sense!

WRT the low-hanging fruit - I like the idea of a Python script that munges output to Parquet and gets called from gcp_pudl_etl.sh. That would let us incrementally port the gcp_pudl_etl.sh logic into a Python script also, which might be nice.

To me, the goal of the MVP is "get PUDL-as-Parquet onto the Internet for users to access." So, of the bullets in your MVP list, I think we could probably punt on the Kaggle piece and updating the .gz to .zip. But, those also seem like pretty small changes so that distinction is a bit academic 🤷 .

zaneselvans commented 10 months ago

My thought with including the Kaggle/ZIP bit is that dealing with their (and Zenodo's) need for a flat file hierarchy is an issue we need to address that affects the outward "API" of working with the data we're publishing, and given the trouble our windows users have had working with gzip it seemed like the newly archived subdirectory should use ZIP rather than gzip... and having several different file archiving systems going on at once seems silly.

If we make all those changes, then the remaining potential integration is entirely internal, and won't change how the data presents to users (except maybe it'll have better internal metadata and row groups for more efficient access).

I'd like to see if some combination of RMI + ZeroLab are interested in supporting ~10 hours of work to get the basic version of this up and running, and if so make it a high priority task.

bendnorman commented 10 months ago

Is there a benefit to having a post ETL script that converts DB tables to parquet files as opposed to an IO Manager that writes dataframes to parquet?

zaneselvans commented 10 months ago

@bendnorman I just think it's much easier to implement / less likely to blow up because it doesn't meddle with the guts of the existing ETL process. I basically have it in a notebook already so it seems like a "quick win" to get the Parquet out there in the world in a minimalist way that we can iterate on.

zaneselvans commented 10 months ago

Only very partially completed by #3222

TrentonBush commented 10 months ago

I feel strongly that Parquet files in the cloud are not the right direction for general data distribution. To be clear, they seem like a superior solution from an engineering perspective, for all the efficiency reasons outlined here. But from a product perspective, the lack of telemetry is simply a dealbreaker IMO.

Hosting the data in a cloud database (BigQuery is one example) allows us to see what users actually use and care about -- who are they, what queries did they run, what tables did they touch, what columns they pulled, what filters they applied, what time/space level do they aggregate to, etc. PUDL currently suffers from near total blindness with respect to our users and their needs, and that info would be invaluable in

Supporting our claim to impact, helping to secure grant funding to keep this whole project thriving
identifying users (via login-wall) much more precisely than nearly-useless IP addresses, greatly helping both comdev and bizdev and feeding back into # 1
determining what parts of our ecosystem need more, or less, development attention

rousik commented 10 months ago

First, lets start with some technical discussion:

I've tinkered a little with the io manager changes for switching between sqlite/parquet outputs and it should be fairly straightforward. We might even use that directly instead of doing post-processing. What I would suggest is use of env variables as feature flags so that we can on-demand:

turn on writes to parquet files (in addition to sqlite)
control the source for data reads (sqlite by default, with ability to switch to parquet)

This way, we can easily enable emission of parquet files, w/o altering current ETL behaviors (sqlite is authoritative source for subsequent reads) and will also allow us to do easy side-by-side testing (i.e. test whether anything changes when we switch autoritative source for reads from sqlite to parquet).

Now, I agree with what @TrentonBush outlined here in that BigQuery would give us much better product visibility and might offer best of both worlds (i.e. efficient "remote" data access + visibility/instrumentation "for free"). I'm not sure if this is necessarily either/or choice. Publishing data as parquet files might be a good intermediate step for internal uses (e.g. testing/validation/...), and AFAIK we could then feed those files into BigQuery directly as a final product. However, I'm not sure I understand the client space well enough to make a judgement what is the best course of action here. From infrastructure perspective, I feel that parquet would give us incremental benefits for relatively little effort (and remove some burdens associated with the current sqlite format), while we can still aim for eventually publishing our data on BigQuery.

zaneselvans commented 10 months ago

It seems like it would be faster to emit the Parquet files during the ETL since the dataframes will already be in memory, rather than needing to read them out of SQLite again later. However, won't this also mean messing with $DAGSTER_HOME in CI, which doesn't currently work because we're doing the whole ETL with in-process execution so there is nothing in $DAGSTER_HOME? Or do write the Parquet out directly to $PUDL_OUTPUT/parquet/ during the ETL? Do we have to do any work to keep track of which assets are supposed to be written out to Parquet+SQLite, vs. which ones just use the default IOManager?

I would absolutely love to get more information about who is using our data, which data is getting used, and how much, but this sounds like a move away from publishing open data. Which is almost certainly the better business choice, but is it what we want to do?

Do we want to create a system where the open data is intentionally harder to work with than the proprietary data? How much usage information can we get from S3? We should figure out how to understand if this is true, but I think we lose a lot of potential users as soon as they have to set up any kind of cloud billing or authentication.

Being able to just do:

df = pd.read_parquet("s3://pudl.catalyst.coop/nightly/parquet/out_eia923__monthly_generation_fuel_by_generator.pq")

or

CREATE TABLE dhpa
AS SELECT *
FROM 's3://pudl.catalyst.coop/nightly/parquet/out_ferc714__hourly_predicted_state_demand.pq';

is pretty nice.

Can we easily set up a free registration wall that doesn't require a human in the loop for BigQuery? That was always a big source of friction with the JupyterHub. If we deploy the data both to BigQuery and a public bucket of Parquet files, how much would the Parquet cannibalize demand for BigQuery? Does it serve distinct user needs?

jdangerx commented 10 months ago

Parquet technical stuff:

I think we'd write directly to $PUDL_OUTPUT/parquet/* during the ETL. If we are planning on making the sqlite and parquet files contain the same data, then the pudl_sqlite_io_manager would write to both every time. If we aren't, then we'd have to have a pudl_sqlite_and_parquet_io_manager to differentiate between the two, and we'd then have to go through and change every resource according to our desired behavior.

BigQuery:

I could imagine someone preferring the BigQuery technical hoops over the Parquet technical hoops, but to me they're somewhat similar. I think we need to figure out:

are people actually interested in using BigQuery to access our data? the advantages to them are that they don't have to set up a local dev environment & they don't have to download data that is not in the output of their query... anything else?
can we actually look at the whole query history of the project? what analytics are available to us if we distribute data via BigQuery?

I think it's worth some exploration, we could potentially manually publish our data on BigQuery once just as a pilot to see who bites. And if people seem interested we could invest actual time into updating it automatically every time nightly builds pass, or every time a stable build passes.

rousik commented 10 months ago

One more question. Right now we distinguish the database and table hierarchies, e.g. table 'foo' inside pudl.sqlite vs ferc714.sqlite. If we switch to parquet for everything (not just pudl.sqlite), do we want to retain this hierarchy, e.g. use $PUDL_OUTPUT/parquet/pudl/${table_name}.pq instead of flat hierarchy that removes database part. Deeper hierarchy might be useful to separate out core tables from stuff like EPA CEMS.

zaneselvans commented 10 months ago

@rousik Right now I'm only thinking about the PUDL DB tables. The XBRL & DBF derived SQLite DBs have much messier data types, and honestly most people should probably stay away from them because the data is only semi-structured. I think there would be a couple thousand files/tables if we exploded them all into individual parquet files.

We also have to maintain a flat file hierarchy in some contexts, like Kaggle and Zenodo, which will probably mean we have to zip up subdirectories for distribution on those platforms, since we'll want some kind of separation between the small number of top-level files, and the large number of table-level files in the cloud buckets.

But maybe we can set ourselves up for a smooth transition to a future where we're using a "schema" (in the postgres sense) that's similar for both the object storage prefixes and inside a database that supports that kind of hierarchical organization.

@jdangerx I think that given the free, public S3 bucket from the AWS Open Data Registry, the technical barrier to working with the Parquet files is significantly lower than for any other service that will require people to create an account with a cloud provider and set up billing or authentication. Right now anybody can cut and paste some python that will tell pandas to query any Parquet file we publish in S3 (e.g. via Kaggle, Google CoLab, a local notebook). Similarly they can cut and paste the DuckDB Parquet examples but with our S3 URL and get the table locally if they like SQL.

If we get the POSE grant I think one thing we should do in the user outreach is try to understand which tools & technologies various classes of users are comfortable with.

TrentonBush commented 10 months ago

I agree that there are tradeoffs between usage visibility and barriers to entry and that we should strive to keep barriers low -- that is the whole point of PUDL.

But I see a login requirement as a move supportive of Open Data rather than opposed to it: Making an account is free. All the output data remains free. All the code remains open source. All the input data remains archived and free. The only new barrier is to require an email address to access the data. That one tiny investment by users enables a world of benefits to us that circle right back around and benefit them.

Collecting basic user information helps avoid all three of the worst case scenarios in Open Data: The third-worst Open Data is that which doesn't exist because nobody but the users knew it was needed. The second-worst Open Data is that which gives false hope before revealing itself as abandonware due to lack of funding. And the worst Open Data of all is that which is well funded and diligently and laboriously maintained but nobody actually uses. I think users would gladly donate an email address to prevent these problems.

zaneselvans commented 10 months ago

I guess I don't understand what this login system would look like or how people would be able to get at the data behind it in a way that doesn't either hide the data behind a technical barrier that prevents it from being easily integrated into other workflows or require the use of cloud authentication systems that a lot of people will give up on. I agree many folks would be happy to trade an email address for access, but then what happens?

zaneselvans commented 10 months ago

It looks like both Server Access Logs and CloudTrail logs can be used to identify individual S3 object access events, including the IP of the requester:

bendnorman commented 10 months ago

Like y'all have said I think there is a tension between accessibility and visibility. Maybe there is an option I haven't thought of though that is a good balance of the two.

Also, we aren't flying completely blind with the parquet files in s3. We can track which tables are being pulled and roughly where in the users are. However, I think given PUDL is a pretty niche project, these types of metrics won't be super useful. Building deeper relationships with a smaller number of users and understanding how they use the data is more productive than just tracking total downloads from the s3 bucket.

I like the parquet idea because it's a simple technical solution to access problems our users are facing right now. They will also give us information about which tables users are accessing.

I think an ideal tool would:

Require users to easily sign up for an account without needing to set up a billing profile on a cloud provider. The folks at openaddresses created a custom registration wall to gain insight into usage and prevent abuse:
Allow them to download data as sqlite or parquet files to use in CI and local development
Maybe provide a SQL interface so we can better understand how people are querying the data
Allow us to throttle compute and egress usage so we don't get slapped with unexpected bills
Requires minimal effort on our end to set up and maintain

Here are a couple of options I can think of:

Distribute parquet files in our AWS supported s3 bucket. Low barrier, free but we don't get a great sense for who is using our data.
Create a registration wall that automatically adds the person to an IAM group in GCP that can pull parquet files from a GCS bucket. Medium high barrier, we'd have to cover the costs but we'd get a better sense for who is using our data. I'm also not sure if it's possible to throttle bucket egress.
Distribute out data on the Snowflake marketplace. This would be convenient for us because 1) built to distribute data as a product 2) I don't think it costs us anything 3) it might allow us to track who is using the data 4) gives us a wider audience. I think the biggest downside of this option is it requires users to understand how to manage and pay for Snowflake which makes the data much harder to access.
@e-belfer had the idea to keep using the s3 bucket but have users fill out a form to receive a link to the free s3 bucket. This would allow us to learn more about who is accessing our data but might not catch everyone because the "registration" isn't actually required to access the data.

I think option 1 is the best we've got given our funding and desired level of accesibility. I'm bummed there doesn't seem to be an obvious technical solution to keep the data accessible and that allows us to better understand our users. I'm comfortable with a world where we distribute parquet files in a free s3 bucket that provides some basic metrics paired with regular user outreach and feedback. We don't really have a structured feedback process in place but it should be a part of any community development grant we apply for.

I'll ask the AWS open data program folks if they have any recommendations.

jdangerx commented 10 months ago

@bendnorman do you know what the total egress of our S3 bucket is, currently? I wonder how much it would cost to cover that out of our own pocket.

It wouldn't be too tricky to set up a service that:

makes people authenticate with email/pw, API key, or OAuth
gives authorized users a signed url to download the data they request
is rate-limited internally to limit egress fees

bendnorman commented 10 months ago

@jdangerx our s3 bucket usage hasn't been super consistent. We've averaged about $160 a month on the AWS account. 96% of the s3 cloud costs have been on egress.

The signed URL authentication workflow seems promising! I'm more comfortable with a simple registration workflow and a rate-limited bucket. If we did go down that path what would we do about our other access modes: datasette, kaggle, zenodo? Are you worried users would hit the registration wall and try to grab the data via one o the other access modes when their user case is probably best suited for a cloud bucket?

zaneselvans commented 10 months ago

For @gschivley's use case, I think he needs to include a small subset of our data as an input into PowerGenome, which is being used by his users, who ideally don't need to know anything about PUDL, and currently the need to download our now 14GB SQLite DB makes that downstream integration unworkable. In any of these authenticated / registration-wall scenarios, he'd either need to bundle his own authentication credentials with the software or all of the PowerGenome users would need to set it up for themselves, which doesn't seem like a very gracefully scalable UX. In that scenario given the open licensing, I would expect that someone like Greg would instead download our data, pull the small subset that they need out of it, and redistribute it as a static asset, which would then no longer benefit from our regular updates or versioning or clear provenance.

gschivley commented 10 months ago

@zaneselvans summarized my feelings pretty well. A 14GB SQLite db where my users don't touch most of the data is tough. At one time I did ask users to register for an EIA API key but then shifted to downloading the JSON file. I plan on writing code to download the necessary portions of relevant tables and cache it locally. In subsequent runs the code will (hopefully) check to see if there have been any updates.

My first preference is to have an openly accessible set of parquet files. But if you can figure out a clean authentication/registration system it wouldn't be the end of the world for users to register and obtain an API key. I might distribute a single key within Princeton but wouldn't bundle it with the code or otherwise share it on my repo.

Are any of you familiar with the wind/solar generation data available at renewables.ninja? They have an API that allows for limited downloads without authentication and a much higher limit with registration/authentication. Maybe a similar system could work here? It would allow all queries to be logged and provide tracking for regular users while also allowing limited use without the friction of registration.

grgmiller commented 9 months ago

For our purposes so far, we've been able to manage with the sqlite database file. We recently added functionality to read the sqlite database file directly from s3 when running our OGE pipeline (https://github.com/singularity-energy/open-grid-emissions/pull/338), and this seems to be working well for us so far.

I actually liked how in the 2023.12.01 data release (which we are still using) how all of the CEMS parquet files were consolidated to a single file instead of having a bunch of smaller sub files in different directories. IMO it made it easier to access and read.

zaneselvans commented 9 months ago

@grgmiller We're going to keep producing the EPA CEMS in a single Parquet file (with state-year row groups internally) and it'll continue to be distributed alongside the SQLite DBs for the time being since that's the only form we distribute the CEMS in.

And we'll also include that same consolidated EPA CEMS parquet file in the pure-parquet outputs we're experimenting with, so if someone is working with Parquet, they can access all of the data using the same pattern.

catalyst-cooperative / pudl