HTTPArchive / data-pipeline

The new HTTP Archive data pipeline built entirely on GCP
Apache License 2.0
5 stars 0 forks source link

Reorganize the BigQuery datasets to be more efficient #15

Open rviscomi opened 2 years ago

rviscomi commented 2 years ago

Similar to the almanac dataset, we want the monthly results to be queryable in partitioned and clustered tables.

We'll need a deprecation plan to avoid suddenly breaking existing queries:

rviscomi commented 2 years ago

Brainstorming new tables/schemas.

httparchive.all.pages

All-in-one table containing all page-based results from all crawls for all clients.

Partition by:

Cluster by:

Combines data from multiple existing tables:

Additional enhancements:

Schema

[
    {
        "name": "date",
        "type": "DATE",
        "mode": "REQUIRED",
        "description": "YYYY-MM-DD format of the HTTP Archive monthly crawl"
    },
    {
        "name": "client",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "Test environment: desktop or mobile"
    },
    {
        "name": "page",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "The URL of the page being tested"
    },
    {
        "name": "is_root_page",
        "type": "BOOLEAN",
        "mode": "REQUIRED",
        "description": "Whether the page is the root of the origin"
    },
    {
        "name": "root_page",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "The URL of the root page being tested, the origin followed by /"
    },
    {
        "name": "rank",
        "type": "INTEGER",
        "mode": "NULLABLE",
        "description": "Site popularity rank, from CrUX"
    },
    {
        "name": "wptid",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "ID of the WebPageTest results"
    },
    {
        "name": "payload",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "JSON-encoded WebPageTest results for the page"
    },
    {
        "name": "summary",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "JSON-encoded summarization of the page-level data"
    },
    {
        "name": "custom_metrics",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "JSON-encoded test results of the custom metrics"
    },
    {
        "name": "lighthouse",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "JSON-encoded Lighthouse report"
    },
    {
        "name": "features",
        "type": "RECORD",
        "mode": "REPEATED",
        "description": "Blink features detected at runtime (see https://chromestatus.com/features)",
        "fields": [
            {
                "name": "feature",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Blink feature name"
            },
            {
                "name": "id",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Blink feature ID"
            },
            {
                "name": "type",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Blink feature type (css, default)"
            }
        ]
    },
    {
        "name": "technologies",
        "type": "RECORD",
        "mode": "REPEATED",
        "description": "Technologies detected at runtime (see https://www.wappalyzer.com/)",
        "fields": [
            {
                "name": "technology",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Name of the detected technology"
            },
            {
                "name": "categories",
                "type": "STRING",
                "mode": "REPEATED",
                "description": "List of categories to which this technology belongs"
            },
            {
                "name": "info",
                "type": "STRING",
                "mode": "REPEATED",
                "description": "Additional metadata about the detected technology, ie version number"
            }
        ]
    },
    {
        "name": "metadata",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "Additional metadata about the test"
    }
]

Example queries

Median page weight over time

SELECT
  # No need to parse the date or client from _TABLE_SUFFIX
  date,
  client,
  # Existing summary stats can be extracted from the JSON-encoded `summary` field
  APPROX_QUANTILES(JSON_VALUE(summary, '$.bytesTotal'), 1000)[500] AS median_page_weight
FROM
  # New table address
  `httparchive.all.pages`
WHERE
  # The table is partitioned by date, so we don't incur any costs for data older than 2020
  date >= '2020-01-01' AND
  # Only measure root/home pages for consistency
  is_root_page
GROUP BY
  date,
  client
ORDER BY
  date,
  client

httparchive.all.requests

All-in-one table containing all request-based results for all crawls and clients.

Partition by:

Cluster by:

Combines data from multiple existing tables:

Additional enhancements:

Schema

[
    {
        "name": "date",
        "type": "DATE",
        "mode": "REQUIRED",
        "description": "YYYY-MM-DD format of the HTTP Archive monthly crawl"
    },
    {
        "name": "client",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "Test environment: desktop or mobile"
    },
    {
        "name": "page",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "The URL of the page being tested"
    },
    {
        "name": "is_root_page",
        "type": "BOOLEAN",
        "mode": "NULLABLE",
        "description": "Whether the page is the root of the origin."
    },
    {
        "name": "root_page",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "The URL of the root page being tested"
    },
    {
        "name": "url",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "The URL of the request"
    },
    {
        "name": "is_main_document",
        "type": "BOOLEAN",
        "mode": "REQUIRED",
        "description": "Whether this request corresponds with the main HTML document of the page, which is the first HTML request after redirects"
    },
    {
        "name": "type",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "Simplified description of the type of resource (script, html, css, text, other, etc)"
    },
    {
        "name": "index",
        "type": "INTEGER",
        "mode": "NULLABLE",
        "description": "The sequential 0-based index of the request"
    },
    {
        "name": "payload",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "JSON-encoded WebPageTest result data for this request"
    },
    {
        "name": "summary",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "JSON-encoded summarization of request data"
    },
    {
        "name": "request_headers",
        "type": "RECORD",
        "mode": "REPEATED",
        "description": "Request headers",
        "fields": [
            {
                "name": "name",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Request header name"
            },
            {
                "name": "value",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Request header value"
            }
        ]
    },
    {
        "name": "response_headers",
        "type": "RECORD",
        "mode": "REPEATED",
        "description": "Response headers",
        "fields": [
            {
                "name": "name",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Response header name"
            },
            {
                "name": "value",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Response header value"
            }
        ]
    },
    {
        "name": "response_body",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "Text-based response body"
    }
]

Example queries

rviscomi commented 2 years ago

Generating May 2022 data for httparchive.all.pages:

CREATE TEMP FUNCTION GET_CUSTOM_METRICS(payload STRING) RETURNS STRING LANGUAGE js AS '''
const $ = JSON.parse(payload);
return JSON.stringify(Object.fromEntries($._custom.map(name => {
  let value = $[`_${name}`];
  if (typeof value == 'string') {
    try {
      value = JSON.parse(value);
    } catch (e) {
      // The value is not a JSON string.
    }
  }
  return [name, value];
})));
''';

CREATE TEMP FUNCTION GET_FEATURES(payload STRING)
RETURNS ARRAY<STRUCT<feature STRING, id STRING, type STRING>> LANGUAGE js AS
'''
  function getFeatureNames(featureMap, featureType) {
    try {
      return Object.entries(featureMap).map(([key, value]) => {
        // After Feb 2020 keys are feature IDs.
        if (value.name) {
          return {'feature': value.name, 'type': featureType, 'id': key};
        }
        // Prior to Feb 2020 keys fell back to IDs if the name was unknown.
        if (idPattern.test(key)) {
          return {'feature': '', 'type': featureType, 'id': key.match(idPattern)[1]};
        }
        // Prior to Feb 2020 keys were names by default.
        return {'feature': key, 'type': featureType, 'id': ''};
      });
    } catch (e) {
      return [];
    }
  }

  var $ = JSON.parse(payload);
  if (!$._blinkFeatureFirstUsed) return [];

  var idPattern = new RegExp('^Feature_(\\d+)$');
  return getFeatureNames($._blinkFeatureFirstUsed.Features, 'default')
    .concat(getFeatureNames($._blinkFeatureFirstUsed.CSSFeatures, 'css'))
    .concat(getFeatureNames($._blinkFeatureFirstUsed.AnimatedCSSFeatures, 'animated-css'));
''';

WITH pages AS (
  SELECT
    _TABLE_SUFFIX AS client,
    url AS page,
    JSON_VALUE(payload, '$._metadata.rank') AS rank,
    JSON_VALUE(payload, '$._metadata.crawl_depth') = '0' AS is_root_page,
    JSON_VALUE(payload, '$._testID') AS wptid,
    GET_CUSTOM_METRICS(payload) AS custom_metrics,
    JSON_QUERY(payload, '$._metadata') AS metadata,
    payload
  FROM
    `httparchive.pages.2022_05_01_*`
# TODO: Backfill when the summary pages is ready.
/* ), summary_pages AS (
  SELECT
    _TABLE_SUFFIX AS client,
    url AS page,
    TO_JSON_STRING(summary_pages) AS summary
  FROM
    `httparchive.summary_pages.2022_05_01_*` AS summary_pages */
), loose_technologies AS (
    SELECT
      _TABLE_SUFFIX AS client,
      url AS page,
      STRUCT(
        app AS technology,
        ARRAY_AGG(DISTINCT category ORDER BY category) AS categories,
        ARRAY_AGG(info) AS info
      ) AS technology
    FROM
      `httparchive.technologies.2022_05_01_*`
    GROUP BY
      client,
      page,
      app
), techs AS (
  SELECT
    client,
    page,
    ARRAY_AGG(technology) AS technologies
  FROM
    loose_technologies
  GROUP BY
    client,
    page
), lh AS (
  SELECT
    _TABLE_SUFFIX AS client,
    url AS page,
    report AS lighthouse
  FROM
    `httparchive.lighthouse.2022_05_01_*`
)

SELECT
  DATE('2022-05-01') AS date,
  client,
  page,
  is_root_page,
  rank,
  wptid,
  payload,
  # TODO: Update when the summary pipeline completes successfully.
  '' AS summary,
  custom_metrics,
  lighthouse,
  GET_FEATURES(payload) AS features,
  technologies,
  metadata
FROM
  pages
LEFT JOIN
  techs
USING
  (client, page)
LEFT JOIN
  lh
USING
  (client, page)
rviscomi commented 2 years ago

Pages data for April 2022. Changed rank to an INT64 field and added the required root_page field.

CREATE TEMP FUNCTION GET_CUSTOM_METRICS(payload STRING) RETURNS STRING LANGUAGE js AS '''
const $ = JSON.parse(payload);
return JSON.stringify(Object.fromEntries($._custom.map(name => {
  let value = $[`_${name}`];
  if (typeof value == 'string') {
    try {
      value = JSON.parse(value);
    } catch (e) {
      // The value is not a JSON string.
    }
  }
  return [name, value];
})));
''';

CREATE TEMP FUNCTION GET_FEATURES(payload STRING)
RETURNS ARRAY<STRUCT<feature STRING, id STRING, type STRING>> LANGUAGE js AS
'''
  function getFeatureNames(featureMap, featureType) {
    try {
      return Object.entries(featureMap).map(([key, value]) => {
        // After Feb 2020 keys are feature IDs.
        if (value.name) {
          return {'feature': value.name, 'type': featureType, 'id': key};
        }
        // Prior to Feb 2020 keys fell back to IDs if the name was unknown.
        if (idPattern.test(key)) {
          return {'feature': '', 'type': featureType, 'id': key.match(idPattern)[1]};
        }
        // Prior to Feb 2020 keys were names by default.
        return {'feature': key, 'type': featureType, 'id': ''};
      });
    } catch (e) {
      return [];
    }
  }

  var $ = JSON.parse(payload);
  if (!$._blinkFeatureFirstUsed) return [];

  var idPattern = new RegExp('^Feature_(\\d+)$');
  return getFeatureNames($._blinkFeatureFirstUsed.Features, 'default')
    .concat(getFeatureNames($._blinkFeatureFirstUsed.CSSFeatures, 'css'))
    .concat(getFeatureNames($._blinkFeatureFirstUsed.AnimatedCSSFeatures, 'animated-css'));
''';

WITH pages AS (
  SELECT
    _TABLE_SUFFIX AS client,
    url AS page,
    SAFE_CAST(JSON_VALUE(payload, '$._metadata.rank') AS INT64) AS rank,
    COALESCE(JSON_VALUE(payload, '$._metadata.crawl_depth') = '0', TRUE) AS is_root_page,
    COALESCE(JSON_VALUE(payload, '$._metadata.root_page_url'), url) AS root_page,
    JSON_VALUE(payload, '$._testID') AS wptid,
    GET_CUSTOM_METRICS(payload) AS custom_metrics,
    JSON_QUERY(payload, '$._metadata') AS metadata,
    payload
  FROM
    `httparchive.pages.2022_04_01_*`
), summary_pages AS (
  SELECT
    _TABLE_SUFFIX AS client,
    url AS page,
    rank,
    TO_JSON_STRING(summary_pages) AS summary
  FROM
    `httparchive.summary_pages.2022_04_01_*` AS summary_pages
), loose_technologies AS (
    SELECT
      _TABLE_SUFFIX AS client,
      url AS page,
      STRUCT(
        app AS technology,
        ARRAY_AGG(DISTINCT category ORDER BY category) AS categories,
        ARRAY_AGG(info) AS info
      ) AS technology
    FROM
      `httparchive.technologies.2022_04_01_*`
    GROUP BY
      client,
      page,
      app
), techs AS (
  SELECT
    client,
    page,
    ARRAY_AGG(technology) AS technologies
  FROM
    loose_technologies
  GROUP BY
    client,
    page
), lh AS (
  SELECT
    _TABLE_SUFFIX AS client,
    url AS page,
    report AS lighthouse
  FROM
    `httparchive.lighthouse.2022_04_01_*`
)

SELECT
  DATE('2022-04-01') AS date,
  client,
  page,
  is_root_page,
  root_page,
  COALESCE(pages.rank, summary_pages.rank) AS rank,
  wptid,
  payload,
  summary,
  custom_metrics,
  lighthouse,
  GET_FEATURES(payload) AS features,
  technologies,
  metadata
FROM
  pages
LEFT JOIN
  summary_pages
USING
  (client, page)
LEFT JOIN
  techs
USING
  (client, page)
LEFT JOIN
  lh
USING
  (client, page)
rviscomi commented 2 years ago

Ran into OOM issues with generating the all.requests table directly in BQ.

Since we'll eventually need to generate tables in the new all dataset from Dataflow, I decided to prototype how that pipeline would work. See https://github.com/HTTPArchive/bigquery/pull/170. Summary data is omitted for now since we'll be merging pipelines soon anyway.

I've successfully tested it on a single HAR and now attempting a full-scale test on the entire 2022_05_12 crawl, both desktop and mobile concurrently. I expect it to take 6-8 hours (running for ~2 so far).

tunetheweb commented 2 years ago

The all.requests table is clustered by the following columns:

As well as implicitly by the date partitioning on this table.

We're only allowed 4 cluster fields meaning we cannot add any more. However I'm wondering if is_main_document is that useful? It might be much more beneficial to cluster based on page to allow us to cheaply join to the all.pages table. We could still basically filter for is_main_document relatively cheaply by looking at html type and then filtering to is_main_document (appreciate not ALL main documents will be type HTML but most should be). Changing to page would allow us to filter based on rank, and technologies (again both via join) and similarly other items.

WDYT?

rviscomi commented 2 years ago

is_main_document is equivalent to the old firstHtml field, so having that parity might make migrating to the new schema easier. But I'm happy to explore other faster/cheaper schemas. Would you be willing to create a temp table with the proposed schema and benchmark a few test queries against each?

brendankenny commented 2 years ago

If I query for

SELECT
  JSON_VALUE(lighthouse, '$.finalUrl') AS final_url,
FROM `httparchive.all.pages`
WHERE
  date = "2022-10-01"
  AND client = 'mobile'
  AND is_root_page

The BQ UI warns me I'm going to be querying 13.65 TB, vs 4.63 TB for the (currently) equivalent

SELECT
  JSON_VALUE(report, '$.finalUrl') AS final_url,
FROM `httparchive.lighthouse.2022_10_01_mobile`

Is there some partitioning/clustering trick I should be doing to bring the all number more in line with the lighthouse number? Or more that could be done to the table structure? That would be an unfortunate cost increase for querying the raw data.

tunetheweb commented 2 years ago

Weirdly I don't see that and they look similar cost to me

image

Sure you're "holding it right"?

brendankenny commented 2 years ago

huh, so weird. Pasting into a fresh editor, same 13.65 TB

could being a member of the project somehow affect that? I don't see how, but not sure what other difference there could be.

rviscomi commented 2 years ago

The old BQ estimation behavior was to not take clustered fields into consideration. I'm also seeing the new behavior showing the 4.67 TB estimate, so maybe they've only partially rolled it out.

brendankenny commented 2 years ago

via the CLI:

bq query --nouse_legacy_sql --dry_run 'SELECT
  JSON_VALUE(lighthouse, "$.finalUrl") AS final_url,
FROM `httparchive.all.pages`
WHERE
  date = "2022-10-01"
  AND client = "mobile"
  AND is_root_page'

also gives me "this query will process upper bound of 15012318480052 bytes of data" (aka 13.65 TiB). Can either of you try it with a secondary cloud project to see if you get the same numbers?

If everyone eventually gets the lower number, obviously this isn't an actual issue.

tunetheweb commented 2 years ago

I see 13.65TB when using a random account. Hmm....

rviscomi commented 2 years ago

If you run the query it should still only process 4.67 TiB

max-ostapenko commented 4 weeks ago

Well, here is the current plan for this reorganization: https://docs.google.com/document/d/1kNCpVgvC0W77HJzXiNvlchuGjCCY6uymIfv0nH10tF0/edit