18F / analytics-reporter

Lightweight analytics reporting and publishing tool for Digital Analytics Program's Google Analytics 360 data.
https://analytics.usa.gov/
Other
629 stars 153 forks source link
analytics google-analytics

Build Status Snyk Code Climate

Analytics Reporter

A lightweight system for publishing analytics data from the Digital Analytics Program (DAP) Google Analytics 4 government-wide property. This project uses the Google Analytics Data API v1 to acquire analytics data and then processes it into a flat data structure.

The project previously used the Google Analytics Core Reporting API v3 and the Google Analytics Real Time API v3, also known as Universal Analytics, which has slightly different data points. See Upgrading from Universal Analytics for more details. The Google Analytics v3 API will be deprecated on July 1, 2024.

This is used in combination with analytics-reporter-api to power the government analytics website, analytics.usa.gov.

Available reports are named and described in api.json and usa.json. For now, they're hardcoded into the repository.

The process for adding features to this project is described in Development and deployment process.

Local development setup

Prerequistites

Install dependencies

npm install

Linting

This repo uses Eslint and Prettier for code static analysis and formatting. Run the linter with:

npm run lint

Automatically fix lint issues with:

npm run lint:fix

Install git hooks

There are some git hooks provided in the ./hooks directory to help with common development tasks. These will checkout current NPM packages on branch change events, and run the linter on pre-commit.

Install the provided hooks with the following command:

npm run install-git-hooks

Running the unit tests

The unit tests for this repo require a local PostgreSQL database. You can run a local DB server or create a docker container using the provided test compose file. (Requires docker and docker-compose to be installed)

Starting a docker test DB:

docker-compose -f docker-compose.test.yml up

Once you have a PostgreSQL DB running locally, you can run the tests. The test DB connection in knexfile.js has some default connection config which can be overridden with environment variables. If using the provided docker-compose DB then you can avoid setting the connection details.

Run the tests (pre-test hook runs DB migrations):

npm test

Running the unit tests with code coverage reporting

If you wish to see a code coverage report after running the tests, use the following command. This runs the DB migrations, tests, and the NYC code coverage tool:

npm run coverage

Running the integration tests

The integration tests for this repo require the google analytics credentials to be set in the environment. This can be setup with the dotenv-cli package as described in "Setup Environment" section above.

Note that these tests make real requests to google analytics APIs and should be run sparingly to avoid being rate limited in our live apps which use the same account credentials.

# Run cucumber integration tests
dotenv -e .env npm run cucumber

# Run cucumber integration tests with node debugging enabled
dotenv -e .env npm run cucumber:debug

The cucumber features and support files can be found in the features directory

Running the application as a npm package

npm install -g analytics-reporter

Running the application locally

To run the application locally with database reporting, you'll need a postgres database running on port 5432. There is a docker-compose file provided in the repo so that you can start an empty database with the command:

docker-compose up

Setup environment

See "Configuration and Google Analytics Setup" below for the required environment variables and other setup for Google Analytics auth.

It may be easiest to use the dotenv-cli package to configure the environment for the application.

Create a .env file using env.example as a template, with the correct credentials and other config values. This file is ignored in the .gitignore file and should not be checked in to the repository.

Run the application

# running the app with no config
npm start

# running the app with dotenv-cli
dotenv -e .env npm start

Configuration

Google Analytics

export ANALYTICS_REPORT_EMAIL="YYYYYYY@developer.gserviceaccount.com"
export ANALYTICS_REPORT_IDS="XXXXXX"

You may wish to manage these using autoenv. If you do, there is an example.env file you can copy to .env to get started.

To find your Google Analytics view ID:

  1. Sign in to your Analytics account.
  2. Select the Admin tab.
  3. Select an account from the dropdown in the ACCOUNT column.
  4. Select a property from the dropdown in the PROPERTY column.
  5. Select a view from the dropdown in the VIEW column.
  6. Click "View Settings"
  7. Copy the view ID. You'll need to enter it with ga: as a prefix.

To specify a file path (useful in development or Linux server environments):

export ANALYTICS_KEY_PATH="/path/to/secret_key.json"

Alternatively, to specify the key directly (useful in a PaaS environment), paste in the contents of the JSON file's private_key field directly and exactly, in quotes, and rendering actual line breaks (not \n's) (below example has been sanitized):

export ANALYTICS_KEY="-----BEGIN PRIVATE KEY-----
[contents of key]
-----END PRIVATE KEY-----
"

If you have multiple accounts for a profile, you can set the ANALYTICS_CREDENTIALS variable with a JSON encoded array of those credentials and they'll be used to authorize API requests in a round-robin style.

export ANALYTICS_CREDENTIALS='[
  {
    "key": "-----BEGIN PRIVATE KEY-----\n[contents of key]\n-----END PRIVATE KEY-----",
    "email": "email_1@example.com"
  },
  {
    "key": "-----BEGIN PRIVATE KEY-----\n[contents of key]\n-----END PRIVATE KEY-----",
    "email": "email_2@example.com"
  }
]'
./bin/analytics --only users

If you see a nicely formatted JSON file, you are all set.

AWS

To configure the app for publishing data to S3 set the following environment variables:

export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=[your-key]
export AWS_SECRET_ACCESS_KEY=[your-secret-key]
export AWS_BUCKET=[your-bucket]
export AWS_BUCKET_PATH=[your-path]
export AWS_CACHE_TIME=0

There are cases where you want to use a custom object storage server compatible with Amazon S3 APIs, like minio, in that specific case you should set an extra env variable:

export AWS_S3_ENDPOINT=http://your-storage-server:port

Egress proxy config

The application can be configured to use an egress proxy for HTTP calls which are external to the application's running environment. To configure the app to use an egress proxy, set the following environment variables:

export PROXY_FQDN=[The fully qualified domain of your proxy server]
export PROXY_PORT=[The port for the proxy server]
export PROXY_USERNAME=[The username to use for proxy requests]
export PROXY_PASSWORD=[The password to use for proxy requests]

Other configuration

If you use a single domain for all of your analytics data, then your profile is likely set to return relative paths (e.g. /faq) and not absolute paths when accessing real-time reports.

You can set a default domain, to be returned as data in all real-time data point:

export ANALYTICS_HOSTNAME=https://konklone.com

This will produce points similar to the following:

{
  "page": "/post/why-google-is-hurrying-the-web-to-kill-sha-1",
  "page_title": "Why Google is Hurrying the Web to Kill SHA-1",
  "active_visitors": "1",
  "domain": "https://konklone.com"
}

Use

Reports are created and published using npm start or ./bin/analytics

# using npm scripts
npm start

# running the app directly
./bin/analytics

This will run every report, in sequence, and print out the resulting JSON to STDOUT.

A report might look something like this:

{
  "name": "devices",
  "frequency": "daily",
  "slim": true,
  "query": {
    "dimensions": [
      {
        "name": "date"
      },
      {
        "name": "deviceCategory"
      }
    ],
    "metrics": [
      {
        "name": "sessions"
      }
    ],
    "dateRanges": [
      {
        "startDate": "30daysAgo",
        "endDate": "yesterday"
      }
    ],
    "orderBys": [
      {
        "dimension": {
          "dimensionName": "date"
        },
        "desc": true
      }
    ]
  },
  "meta": {
    "name": "Devices",
    "description": "30 days of desktop/mobile/tablet visits for all sites."
  }
  "data": [
    {
      "date": "2023-12-25",
      "device": "mobile",
      "visits": "13681896"
    },
    {
      "date": "2023-12-25",
      "device": "desktop",
      "visits": "5775002"
    },
    {
      "date": "2023-12-25",
      "device": "tablet",
      "visits": "367039"
    },
   ...
  ],
  "totals": {
    "visits": 3584551745,
    "devices": {
      "mobile": 2012722956,
      "desktop": 1513968883,
      "tablet": 52313579,
      "smart tv": 5546327
    }
  },
  "taken_at": "2023-12-26T20:52:50.062Z"
}

Options

./bin/analytics --output /path/to/data
./bin/analytics --publish
./bin/analytics --only devices
./bin/analytics --only devices,today
./bin/analytics --only devices --slim
./bin/analytics --csv
./bin/analytics --frequency=realtime
./bin/analytics --publish --debug

Saving data to postgres

The analytics reporter can write data is pulls from Google Analytics to a Postgres database. The postgres configuration can be set using environment variables:

export POSTGRES_HOST = "my.db.host.com"
export POSTGRES_USER = "postgres"
export POSTGRES_PASSWORD = "123abc"
export POSTGRES_DATABASE = "analytics"

The database expects a particular schema which will be described in the API server that consumes and publishes this data.

To write reports to a database, use the --write-to-database option when starting the reporter.

Cloud.gov setup

The application requires an S3 bucket and RDS instance running a Postgres database setup in cloud.gov as services. Examples below use the Cloudfoundry CLI.

# Create and bind an S3 bucket service to the app
cf create-service s3 basic-public analytics-s3
cf bind-service analytics-reporter-consumer analytics-s3

# Create a RDS Postgres service for use by the app
cf create-service aws-rds small-psql analytics-reporter-database

# Connect to the database, enable pgcrypto extension, and create a new database
# for the PgBoss message queue library
cf connect-to-service -no-client analytics-develop analytics-reporter-database-develop
psql -h localhost -p <port> -U <username> -d <database>
`CREATE EXTENSION IF NOT EXISTS "pgcrypto";`
`\dx` # check installed extension to ensure pgcrypto exists now.
`CREATE DATABASE <message_queue_database_name>;`

# Bind the database to both the publisher and consumer apps
cf bind-service analytics-reporter-publisher analytics-reporter-database
cf bind-service analytics-reporter-consumer analytics-reporter-database

# Database migrations for the reporter's analytics database are handled by the
# analytics-reporter-api application. Deploy the API server via CI to migrate
# the database.

# Remove public egress permissions from the space running the application if it has them
cf unbind-security-group public_networks_egress gsa-opp-analytics analytics-dev --lifecycle running

# Create a network policy in the application's space which allows communication to the egress proxy which runs in a space with public egress permissions
cf add-network-policy analytics-reporter-consumer analytics-egress-proxy -s analytics-public-egress -o gsa-opp-analytics --protocol tcp --port 8080

# Create a network policy in the public-egress space which allows communication from the egress proxy back to the application.
# The port for each API call the app makes is determined randomly, so allow the full range of port numbers.
cf target -s analytics-public-egress
cf add-network-policy analytics-egress-proxy analytics-reporter-consumer -s analytics-dev -o gsa-opp-analytics --protocol tcp --port 1-65535

Upgrading from Universal Analytics

Background

This project previously acquired data from Google Analytics V3, also known as Universal Analytics (UA).

Google is retiring UA and is encouraging users to move to their new version Google Analytics V4 (GA4). UA will be deprecated on July 1st 2024.

Migration details

Some data points have been removed or added by Google as part of the move to GA4.

Deprecated fields

New fields

bounce_rate

The percentage of sessions that were not engaged. GA4 defines engaged as a session that lasts longer than 10 seconds or has multiple pageviews.

file_name

The page path of a downloaded file.

language_code

The ISO639 language setting of the user's device. e.g. 'en-us'

session_default_channel_group

An enum which describes the session. Possible values:

'Direct', 'Organic Search', 'Paid Social', 'Organic Social', 'Email', 'Affiliates', 'Referral', 'Paid Search', 'Video', and 'Display'

Public domain

This project is in the worldwide public domain. As stated in CONTRIBUTING:

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.