Powerspace / pg2bq

Export PostgreSQL tables to Google BigQuery
MIT License
37 stars 11 forks source link
bigquery gcloud postgresql sqoop

pg2bq

pg2bq is made for one thing: export tables from PostgreSQL to Google BigQuery.

Why ?

It's useful to keep the data at both places "in-sync" (using cron, airflow, or whatever to schedule the export on a regular basis). If your metadata are on PostgreSQL, but your realtime data are in BigQuery, it's probable you want to join them.

Note: internally, it is using the Spark framework for the sake of simplicity, but no Hadoop cluster is needed. It is configured as a "local" cluster by default, meaning the application is running standalone.

How to run it

gcloud { project = "gcloud-project-id" service-account-key-path = "/path/to/service-account-key.json" bq.dataset = "mypg" gcs.tmp-bucket = "pg-export-tmp" }

- Run the application specifying the config file:

GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json ./bin/pg2db -Dconfig.file=configuration.conf

- Done!

Add this to a scheduler every 10min and enjoy your JOINs in BigQuery.

# What does it do exactly ?

- It exports the data from the tables into DataFrames
- It saves them into GCS as `.avro` to keep the schema along the data: this will avoid to specify/create the BigQuery table schema beforehands.
- It starts BigQuery jobs to import those `.avro` into the respective BigQuery tables.

# Development

To run the application in dev mode, create a proper `application.conf` and run the app:

- via the IDE, just run `Main`
- via sbt, by pre-packaging the whole thing:

$ sbt stage && ./target/universal/stage/bin/pg2bq