Bundling parquet extension?

mhkeller commented 1 month ago

The pg_parquet extension was just released, which brings some great support for parquet files. I wanted to drop it here to put on the radar as something that might be of interest to bundle with Postgres.app in the same way that PostGIS is super easy to get up and running. I realize that each additional extension adds complexity, though, and the line has to be drawn somewhere.

jakob commented 1 month ago

At a first glance it looks like something that is not easy to build. Apache Arrow looks like a non-trivial dependency.

I also have no idea how popular the parquet file format is. Is it something that would provide a lot of value to people? I'm all for it if we can make Postgres.app useful for additional audiences, but I really have no idea how big Parquet is and what communities use it.

mhkeller commented 1 month ago

Yea I think that's the key question and a bit of the chicken or the egg — usage being a function of availability and availability being a function of usage. I can give a bit of context as a non-expert so if others want to chime in, feel free.

Parquet is a new-ish data storage format that has the advantage of being typed and also storing data in a columnar format so it works well for a lot of data analysis pipelines. (There are probably many other things I'm omitting but for my purposes, those are the two main things that are neat about it.)

Parquet has gotten more popular since DuckDB started supporting querying Parquet files directly since you save the step of loading data into your database and, if I understand correctly, the query can efficiently locate the required data within the file without loading the entire file into memory. So that enables a workflow where you put a parquet file on S3 — or some other kind of cloud bucket — use DuckDB locally to query it and you essentially have a shared database without dealing with the infrastructure of running a shared database. Other projects like Mosaic use DuckDB and Parquet to build performant data visualizations.

For my purposes, I'd like to use Parquet files more because having something typed is a good alternative to passing CSVs around and would like an easier way to create Parquet files from data already in a database. For example, if I want to export a view from a shared Postgres table and do some additional analysis on it in a scripting language, having Parquet as the interchange format is a good option. I'm able to install pg_parquet locally for myself and play around with it but the advantage of it being bundled is to make it easier for less-technical co-workers to share a workflow – similar to how getting set up with PostGIS is super easy.

This space is still pretty new, though, so a sensible approach would likely be to wait six months / year to see how the ecosystem and workflows develop and if anyone besides me finds this useful. Adding a gnarly dependency and supporting that going forward is not something I wish upon anyone.

PostgresApp / PostgresApp

Bundling parquet extension? #773