MarkEdmondson1234 / ga-bq-stream

Stream JSON data into BigQuery
30 stars 18 forks source link

Stream Google Analytics data to BigQuery via Google App Engine

This is a Google App Engine application for streaming JSON into BigQuery, inspired by Luke Cempre's post on AnalyticsPros

When you request the URL /bq-streamer with the parameter ?bq={'json':'example'} then it will start a new task to put that JSON into a partitioned table in BigQuery.

The task is activated via a POST request to /bq-task with the same JSON as passed to /bq-streamer

See the Google App Engine Samples for how to deploy.

Setup

  1. Create a dataset and date partitioned BigQuery table to receive the hits. Probably want to delete data after some time in prod.
    • Create empty table > set table name > add schema > Options: Partitioning to "DAY"
  2. Add any other fields to the table that you wish to send in, the script by default also adds ts as a STRING that is a UNIX timestamp so add that too. Any unset fields won't be seen by default.
  3. Edit the app.yaml field env_variables to your BigQuery details, and your secret code word:

Example:

runtime: python27
api_version: 1
threadsafe: yes

handlers:
- url: .*
  script: main.app

#[START env]
env_variables:
  DATASET_ID: tests
  TABLE_ID: realtime
  SECRET_SALT: changethistosomethingunique
#[END env]
  1. Deploy the app (see below)
  2. Call the https://your-app-id.appost.com/bq-streamer?bq={"field_name":"field_value", "field_name2":"field_value2"} to add the fields to your BigQuery table.

For testing you can call in the browser the URL via GET but for production call via POST with the body JSON available to the bq field.

Other examples:

https://your-app-id.appspot.com/bq-streamer?bq={'bar':'blah5','foo':'hi'}

  1. The data won't appear in the BQ table preview quickly but you can query the table via something like SELECT * FROM dataset.tableID to see the realtime hits seconds after the hit it made. Turn off USE CACHED RESULTS. It also adds a ts field with a unix timestamp of when the hit was sent to BigQuery.

  2. View the logs for any errors https://console.cloud.google.com/logs/viewer

Deploying

  1. Download the Google App Engine Python SDK for your platform.
  2. Open terminal then browse to the folder containing app.yaml
  3. The app requires extra libraries to be installed. You need to install the dependencies with pip.

This installs the libraries to a new folder lib in the app directory. It most likely won't need to add anything.

    pip install -t lib -r requirements.txt
  1. Deploy via:

    gcloud app deploy --project [YOUR_PROJECT_ID]

Optional flags:

  1. Visit https://your-app-id.appost.com to view your application.

Additional resources

For more information on App Engine:

https://cloud.google.com/appengine

For more information on Python on App Engine:

https://cloud.google.com/appengine/docs/python

Quotas and limits

Using the BigQuery realtime

Included is also a class to query the entire BigQuery table, in production you would want to limit query to greater than a timestamp in ts to avoid it being too large.

Visiting http://your-app-id.appspot.com/bq-get will get you the BQ table in JSON format - it has no caching enabled so it will also be the freshest results.

Hash

To protect privacy, a hash also has to be supplied. This is generated via the secret salt name you should change in setup to something unique for you, and then generate it in python via:

import hashlib
# your query
q = "SELECT * FROM %s.%s"
# the unique secret word in the app.yaml environment vars
salt = "SECRETWORD"

## use this in the API call for parameter `hash`
hashlib.sha224(q+salt).hexdigest()

Or view the expected hash in the error logs when you attempt connection

Limit

By default it will return the most recent record of the table - pass limit=X to get more.

Example:

http://your-app-id.appspot.com/bq-get?limit=1000&hash=63780cbd6c3f6e632b57d9f8f70ea7edcd3c6eb5cbdd1b3183ba28b6

Output

The output is a list of lists like this:

[
  ["ts", "1483705837.05", null],
  ["blah4", "1483710114.48", null],
  ["1", "4", null],
  ["1", "4", null],
  ["this_is_json", "ts", null],
  ["h2i", "1483714626.43", "blah6"],
  ["y", "e", null],
  ["blah3", "1483707843.84", null],
  ["1", "4", null],
  ["y", "e", null],
  ["y", "e", null],
  ["blah3", "1483709017.45", null],
  ["y", "e", null],
  ["h2i", "1483714325.09", "blah6"],
  ["blah3", "1483710007.28", null],
  ["1", "4", null],
  ["h23i", "1483716480.93", "bl8h7"],
  ["hi", "1483710547.94", "blah5"],
  ["h2i", "1483716200.83", "blah7"]
]

By default the query is:

    query = 'SELECT * FROM %s.%s LIMIT %s' % (datasetId, tableId, limit)

You can use your own query by supplying a q parameter to the URL. It uses Standard SQL, not legacy. Use the %s.%s in your query that will be filled in with the correct dataset and tableId. The limit URL parameter is ignored when supplying your own SQL.

http://your-app-id.appspot.com/bq-get?hash=XXXXXq=SELECT * FROM %s.%s LIMIT 10

Applications can then use this data for display.

Sending hits from Google Tag Manager

<script>

    var bqArray = {};

  bqArray["fieldname"] = "{{dataLayer}}";
  bqArray["fieldname2"] = "{{dataLayer2}}";

  jQuery.post("https://YOUR-PROJECT-ID.appspot.com/bq-streamer", {"bq":JSON.stringify(bqArray)});

</script>