This repository handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the httparchive
dataset in BigQuery.
The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the main
branch is used on each triggered pipeline run.
Tag: crawl_complete
Tag: cwv_tech_report
Consumers:
Tag: blink_features_report
Consumers:
Tag: crawl_results_legacy
crawl-complete PubSub subscription
Tags: ["crawl_complete", "blink_features_report", "crawl_results_legacy"]
bq-poller-cwv-tech-report Scheduler
Tags: ["cwv_tech_report"]
In order to unify the workflow triggering mechanism, we use a Cloud Run function that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.
In workflow settings vars:
env_name: dev
to process sampled data in dev workspace.today
variable to a month in the past. May be helpful for testing pipelines based on chrome-ux-report
data.definitions/extra/test_env.sqlx
script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.
The issues within the pipeline are being tracked using the following alerts:
Error notifications are sent to #10x-infra Slack channel.