pg2bq is made for one thing: export tables from PostgreSQL to Google BigQuery.
It's useful to keep the data at both places "in-sync" (using cron, airflow, or whatever to schedule the export on a regular basis). If your metadata are on PostgreSQL, but your realtime data are in BigQuery, it's probable you want to join them.
Note: internally, it is using the Spark framework for the sake of simplicity, but no Hadoop cluster is needed. It is configured as a "local" cluster by default, meaning the application is running standalone.
configuration.conf
for pg2bq to know where to grab and put the data (HOCON):
jdbc {
url = "jdbc:postgresql://mypg:5432/mydb"
user = "myuser"
password = "mypwd"
tables = [ "user", "campaign", "website" ]
}
gcloud { project = "gcloud-project-id" service-account-key-path = "/path/to/service-account-key.json" bq.dataset = "mypg" gcs.tmp-bucket = "pg-export-tmp" }
- Run the application specifying the config file:
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json ./bin/pg2db -Dconfig.file=configuration.conf
- Done!
Add this to a scheduler every 10min and enjoy your JOINs in BigQuery.
# What does it do exactly ?
- It exports the data from the tables into DataFrames
- It saves them into GCS as `.avro` to keep the schema along the data: this will avoid to specify/create the BigQuery table schema beforehands.
- It starts BigQuery jobs to import those `.avro` into the respective BigQuery tables.
# Development
To run the application in dev mode, create a proper `application.conf` and run the app:
- via the IDE, just run `Main`
- via sbt, by pre-packaging the whole thing:
$ sbt stage && ./target/universal/stage/bin/pg2bq