RDBMS vs. Data Packages / Parquet as Authoritative Output

zaneselvans commented 5 years ago

I want to pose a high level design question to folks here for discussion.

In the beginning of this project, we thought that the database itself, which our software currently populates, was going to be the primary output, and that folks would be able to boot it up and access it directly via a variety of tools. As we've worked on this longer and interacted with a variety of people who might benefit from the data we're compiling, this design decision has run into a few snags:

The technical overhead associated with installing all of the required software, getting PostgreSQL up and running with the right permissions, updating the datastore with original data from the various agencies, and then accessing the data in database form is... a bit much, it turns out, for most people. Many even semi-technical users just want to grab the data and go, using whatever tools they are already familiar with.
Some of the data that we now want to make available to users (I'm looking at you, EPA CEMS, ISO/RTO, and FERC EQR) are much larger than the EIA & FERC forms that we started out with. Like, 1000 times larger. This makes re-loading the database very time consuming, and means that more specialized tools are required to work with the data in bulk (e.g. dask and other out-of-core computational options).
Some users are only interested in a subset of the data -- in space, time, by data source, type of plant, etc.

As a result, we've started encapsulating the outputs as data packages,(#211) or in the case of larger data (EPA CEMS) maybe as compressed columnar data (aka Apache Parquet) with the intent of distributing those data products for others to access and work with more easily, and for potential archiving on sites like DataHub.io, Zenodo, and OpenEI. The plan right now is that those packages will be available both in the form of well normalized RDBMS tables, and pre-compiled dataframe like outputs -- allowing either immediate analysis, or easy instantiation of a database containing the data, without requiring the full data processing pipeline to be run. We're also looking at hosting these static outputs on Pangeo or another cloud based computational platform for easy access and in-situ use.

However, at the same time, we have continued to develop and maintain the underlying database, and build our own analyses and derived value outputs from data stored in the database.

It seems to me that this could create problems in the future, because we (and anyone else running the database) would be building upon one form of the processed data, and anyone who has downloaded the data packages or other static outputs would be building their applications and analyses upon a slightly different form of the data (hopefuly it would be identical but... famous last words!).

In order to ensure that we and everyone else are all working from the same fundamental data products, might it be preferable to treat the data packages and parquet files (which will ultimately be versioned and assigned DOIs and archived along with their source data and the software release used to generate them) as the primary data, and build all the analysis and further derived works off of them?

This wouldn't be such a huge departure from what we're currently doing -- after the transform step, rather than loading the data into the database, we would write it out to disk directly in datapackages or parquet files, along with all of the associated metadata. The database schema requirements that are in effect being "tested" right now by virtue of the load process could instead be done by validating the data packages (for data type, foreign key constraints, non-nullable fields, unique values, etc.) Subsequently, derived values or other analysis could be done either by reading the data packages into memory as dataframes, or by using them to populate a database (which are what are tacitly going to be expecting most of our users to do anyway) for access similar to what we're doing now.

Otherwise I'm a little big worried we may end up committing ourselves to maintaining and providing support to two different output streams -- the database we're currently using, and the data packages that many other people will end up using.

Do others have thoughts on this? It seems like a good thing to discuss before (fingers crossed!) we get deep into the work on the Sloan Foundation grant work. @cmgosnell @gschivley @karldw @alanawlsn

gschivley commented 5 years ago

Has anyone tested performance of parquet to Postgres? How many Postgres specific features are we using? I’m wondering if data files with code to load into SQL database of choice would be viable. Could I build a SQLite DB?

zaneselvans commented 5 years ago

I don't think any of us have played around with other databases for the data. Originally when we were thinking that the DB would be the authoritative source, we wanted to take advantage of the rich native data types that Postgres supports, and we started out with the idea of using SQLAlchemy's ORM to define more rich relationships between the utility, plant, and generator entities, but we never ended up going very far in that direction, so my guess is that there aren't a lot of Postgres specific features in use right now -- and in any case, if a lot of people are going to use the data package / Parquet file outputs (which does seem to be the case) then I don't think we'll be able to preserve all of those details anyway in the outputs. Certainly you could slurp up whatever is output in data packages into SQLite. It's a one-liner to do so, because the schema for the database is embeded in the datapackage.json file that contains the package's metadata.

karldw commented 5 years ago

@gschivley, I'm guessing parquet to postgres will only be slightly faster than the current approach of zipped csv to postgres. We're spending most of the time re-rendering the data as an in-memory csv and postgres COPYing it into the database.

Two postgres-specific features that come to mind are window functions and timezones. Timezones aren't a big problem, since we're planning to always store the data as UTC. The lack of window functions can be a headache, depending on your analysis flow.

A couple of other things to throw out there, in case you're trying to build a SQLite database to access the data from outside python:

Apache Drill can query parquet files with SQL syntax (there's an sqlalchemy dialect, and an R package)
R is working on support for Parquet and Arrow. Currently the way to read parquet files is Drill or through python.

@zaneselvans, we might think a little more about when/how to invalidate the current set of parquet files. For example, there's no code that deletes the existing parquet files if the ETL steps change.

zaneselvans commented 5 years ago

@karldw I feel like the issue with the validity / invalidity of the parquet files is probably limited to the development process... As we go to distributing the data packages and parquet files for use, we'll need to integrate that packaging (or at least a subset of it) into the testing scripts. For files out in the wild, I've been imagining bundling the input data, a versioned release of the code, and the resulting outputs in a single archive with a DOI, which along with the environment.yml and those particular versions of the software ought to be able to reproduce exactly the same outputs (haha, famous last words). So a "normal" user who just wants to grab the data and go would usually, I think, download the most recent versioned release of the data they're interested in, and know what vintage of the SW it was generated from. If folks want to clone the repository and run the whole pipeline themselves, would it be unreasonable to leave it up to them to update their local outputs?

karldw commented 5 years ago

Sure, that sounds fine. I'm a fan of more opinionated software that prevents users from making mistakes, but it's not unreasonable to have users update their own outputs.

karldw commented 5 years ago

For all the non-CEMS tables, is the thought to have one parquet file per table? That sounds reasonable to me, but you all know the EIA and FERC1 tables much better than I do.

zaneselvans commented 5 years ago

I was imagining that all the smaller tables would be distributed as tabular data packages (CSV files with JSON metadata wrappers) since that's a rich, transparent format, and the benefits of compressing and using a columnar format are small for those other tables. We can wrap parquet files in generic data packages, but the richness of metadata that can be included in those packages is diminished. And honestly I don't yet know enough about how adds metadata / schemas to collections of parquet files.

If all the tables were stored as parquet files, I imagine that there aren't any in the existing FERC/EIA tables that would need to be partitioned based on size. They should compress down to a few GB. But for the long skinny tables -- CEMS, EQR, and LMP -- it seems like the structure simple, and the size is large, so using something like parquet would offer a big improvement.

I think the way Pangeo works, too, recommends using files-on-disk read into dataframes, rather than a database as the data store. It allows you to dynamically spawn a cluster of identical containers/machines which all have identical access to the data via a shared bucket. Dask then allows you to slurp up all of that data in parallel -- with each machine running a separate read on a separate partition of the data, and storing that data in memory on that machine -- while the overall manipulation of the concatenation of these dataframes is coordinated via a pandas-like API in a Jupyter Notebook. You build up a collection of operations to apply to this segmented dataframe, and then tell Dask to execute it, and the task manager hands out the appropriate work units to all these different machines, and eventually assembles the result into a (hopefully) smaller-than-memory pandas dataframe that you can actually play with dynamically.

If one wanted to do all this with data from a RDBMS, then all of those spawned machines would need to either have their own copy of the DB running locally, or be able to connect to a cloud DB like Redshift, or do a SQL-like query against a partitioned collection of data in buckets using something like Amazon's Athena interface. But for more complicated collections of relationships, like we have with the 860/923 data, that feels like it'll be a pain, and so far as I've seen, the big/fast cloud DB options are expensive... and aren't supported by Pangeo anyway (which is made by a bunch of smart people trying to solve exactly this same problem and I trust them to be doing something right).

But at a more basic level here, I'm asking whether -- given that we are committing to packaging data up and distributing it, and that we expect that to be the data most people use most of the time, does it make sense for us to maintain and use another data storage system that we use, or is that just going to be a maintenance headache? We'll still provide a script to slurp the packaged data into your database of choice through SQLAlchemy (and any DB back end ought to work, since we can't really encode complicated DB specific features in the tabular data packages). It would just be slurp the files, rather than run the pipeline. Or, if you're a glutton for punishment and want the very newest data w/ the development code, you can run the pipeline and then slurp the files.

cmgosnell commented 5 years ago

I'm a little late to this party, but here I go... I think the general framework of having the data packages being the 'real' data set instead of the database tables makes sense to me. I'm mostly concerned with the implementation process and some of the unknowns with using data packaging. To Greg's point about sqlalchemy, I don't think that there is any functionality that we are currently using that will be lost with this transition, but I'm not 100% sure about that.

In terms of implementation, I feeling like it might be a good idea to temporarily amend the ETL process so that it would either generate the existing database and/or generate the data packages based on the post-transformed dataframes. This way we could generate the data packages and poke at them and amend the outputs to use the data packages instead of the db, while the db isn't thoroughly broken. If the full switch over to the data packages takes a while then we'd have dual systems to update and maintain which would be a pain. But if the full switch take a while w/o doing this the whole platform would just be broken for a while. The alternative is that one of us could do all of this all at once, which would put the process in a little bit of a black box for a bit.

karldw commented 5 years ago

This comment doesn't really touch on RDBMS vs data packages, but while we're here, it's worth considering who will be reading the data, and what tools they'll be using.

For example, the current parquet files work fine when written and read by Python and Dask, but can't be read by Drill because Drill doesn't support UINT16 types. People might also want to work with Amazon Athena, which has its own type restrictions. Thoughts?

zaneselvans commented 5 years ago

Annnnd... this is why everyone is still using CSV files. Gah. Is there no agreed upon set of base datatypes that everyone supports? And if so, does it cover what we need it to cover? Can we just adhere to that standard? Really I was just trying to see what working with Parquet files was going to be like when I wrote that conversion script.

karldw commented 5 years ago

How about this:

Integers should be signed 4-byte (32 bit) integers by default, or signed 8-byte if you need to store a really big number
Floating point numbers are fine as either floats or doubles (4 or 8 bytes)
Boolean is fine
Dates are fine
All timestamps are either naive or, preferably, UTC (support for explicit timezones seems weak)
Categories/factors should be converted to their string representation
- Edit: Drill seems to support categorical columns, despite not being mentioned in the documentation. I'm not sure about Athena.
Strings are fine (UTF8-encoded VARCHAR)

Arbitrary-precision numbers, binary strings, dictionaries, and lists are also available, but I'm not sure they're useful for PUDL's storage needs.

Am I missing anything?

zaneselvans commented 5 years ago

Okay, so we have decided to go ahead and move from Postgresql to tabular data packages as the canonical output, and then will use those packages to make the data available in a variety of ways -- including spinning up a local (probably) SQLite database that makes the data dynamically available, archiving them on Zenodo, maybe pushing them to BigQuery or other cloud providers, etc. @cmgosnell is the lead on this process, and it sounds like we'll be working with the folks from Open Knowledge Foundation / Frictionless Data on it. We'll be using the Data Packaging GitHub project to track the progress.

catalyst-cooperative / pudl

RDBMS vs. Data Packages / Parquet as Authoritative Output #258