ga4gh / gks-portal

MIT License
0 stars 1 forks source link

Database modeling - experiment with loading GKS variation data into parquet and DuckDB #8

Open theferrit32 opened 1 month ago

theferrit32 commented 1 month ago

Submitter Name

Kyle Ferriter

Submitter Affiliation

Broad Institute

Project Details

We would like to load GKS data into a relational model for rich querying. We currently heavily use BigQuery in backend processes but would like something free and portable that can be distributed to users. Anyvar uses Postgresql with most data stored in JSON columns. We would like something easier to configure and run than Postgres, such as embedded, single file/directory database like SQLite, Rocksdb, Duckdb. We have looked a little bit at DuckDB, which can import/export Parquet and NDJSON files and can provide a SQL query interface over them with basically no configuration needed. A difficulty with NDJSON import is that the automatic schema inference fails when the data contains heterogeneous rows, which is the case for GKS data.

We are aiming for these deliverables:

Stretch:

Required Skills

theferrit32 commented 1 month ago

Semi-related, this was just posted today:

Pg_parquet: An extension to connect Postgres and parquet

https://news.ycombinator.com/item?id=41871068

pg_parquet is a PostgreSQL extension that allows you to read and write Parquet files, which are located in S3 or file system, from PostgreSQL via COPY TO/FROM commands

https://www.crunchydata.com/blog/pg_parquet-an-extension-to-connect-postgres-and-parquet