HTTP Archive datasets pipeline

This repository handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the httparchive dataset in BigQuery.

Pipelines

The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the main branch is used on each triggered pipeline run.

Crawl results

Tag: crawl_complete

httparchive.crawl.pages
httparchive.crawl.parsed_css
httparchive.crawl.requests

Core Web Vitals Technology Report

Tag: cwv_tech_report

httparchive.core_web_vitals.technologies

Consumers:

HTTP Archive Tech Report

Blink Features Report

Tag: blink_features_report

httparchive.blink_features.features
httparchive.blink_features.usage

Consumers:

chromestatus.com - example

Legacy crawl results (to be deprecated)

Tag: crawl_results_legacy

httparchive.all.pages
httparchive.all.parsed_css
httparchive.all.requests
httparchive.lighthouse.YYYY_MM_DD_client
httparchive.pages.YYYY_MM_DD_client
httparchive.requests.YYYY_MM_DD_client
httparchive.response_bodies.YYYY_MM_DD_client
httparchive.summary_pages.YYYY_MM_DD_client
httparchive.summary_requests.YYYY_MM_DD_client
httparchive.technologies.YYYY_MM_DD_client

Schedules

crawl-complete PubSub subscription

Tags: ["crawl_complete", "blink_features_report", "crawl_results_legacy"]
bq-poller-cwv-tech-report Scheduler

Tags: ["cwv_tech_report"]

Triggering workflows

In order to unify the workflow triggering mechanism, we use a Cloud Run function that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.

Contributing

Dataform development

Create new dev workspace in Dataform.
Make adjustments to the dataform configuration files and manually run a workflow to verify.
Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.

Dataform development workspace hints

In workflow settings vars:
- set env_name: dev to process sampled data in dev workspace.
- change today variable to a month in the past. May be helpful for testing pipelines based on chrome-ux-report data.
definitions/extra/test_env.sqlx script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.

Error Monitoring

The issues within the pipeline are being tracked using the following alerts:

the event trigger processing fails - Dataform Trigger Function Error
a job in the workflow fails - "Dataform Workflow Invocation Failed

Error notifications are sent to #10x-infra Slack channel.

HTTPArchive / dataform

readme