Open rviscomi opened 2 years ago
Brainstorming new tables/schemas.
httparchive
all
pages
requests
httparchive.all.pages
All-in-one table containing all page-based results from all crawls for all clients.
Partition by:
date
(required)Cluster by:
client
is_root_page
rank
Combines data from multiple existing tables:
pages
summary_pages
blink_features
technologies
lighthouse
Additional enhancements:
custom_metrics
fieldfeatures
and technologies
fields are well-defined repeated structs
categories
into a single repeated STRING field, rather than creating new technology+category pairs, which tripped up many users (including myself)summary
field is JSON-encoded so we can support all historical summary_pages
stats if we'd like and add/remove anything over time without breaking the schema[
{
"name": "date",
"type": "DATE",
"mode": "REQUIRED",
"description": "YYYY-MM-DD format of the HTTP Archive monthly crawl"
},
{
"name": "client",
"type": "STRING",
"mode": "REQUIRED",
"description": "Test environment: desktop or mobile"
},
{
"name": "page",
"type": "STRING",
"mode": "REQUIRED",
"description": "The URL of the page being tested"
},
{
"name": "is_root_page",
"type": "BOOLEAN",
"mode": "REQUIRED",
"description": "Whether the page is the root of the origin"
},
{
"name": "root_page",
"type": "STRING",
"mode": "REQUIRED",
"description": "The URL of the root page being tested, the origin followed by /"
},
{
"name": "rank",
"type": "INTEGER",
"mode": "NULLABLE",
"description": "Site popularity rank, from CrUX"
},
{
"name": "wptid",
"type": "STRING",
"mode": "NULLABLE",
"description": "ID of the WebPageTest results"
},
{
"name": "payload",
"type": "STRING",
"mode": "NULLABLE",
"description": "JSON-encoded WebPageTest results for the page"
},
{
"name": "summary",
"type": "STRING",
"mode": "NULLABLE",
"description": "JSON-encoded summarization of the page-level data"
},
{
"name": "custom_metrics",
"type": "STRING",
"mode": "NULLABLE",
"description": "JSON-encoded test results of the custom metrics"
},
{
"name": "lighthouse",
"type": "STRING",
"mode": "NULLABLE",
"description": "JSON-encoded Lighthouse report"
},
{
"name": "features",
"type": "RECORD",
"mode": "REPEATED",
"description": "Blink features detected at runtime (see https://chromestatus.com/features)",
"fields": [
{
"name": "feature",
"type": "STRING",
"mode": "NULLABLE",
"description": "Blink feature name"
},
{
"name": "id",
"type": "STRING",
"mode": "NULLABLE",
"description": "Blink feature ID"
},
{
"name": "type",
"type": "STRING",
"mode": "NULLABLE",
"description": "Blink feature type (css, default)"
}
]
},
{
"name": "technologies",
"type": "RECORD",
"mode": "REPEATED",
"description": "Technologies detected at runtime (see https://www.wappalyzer.com/)",
"fields": [
{
"name": "technology",
"type": "STRING",
"mode": "NULLABLE",
"description": "Name of the detected technology"
},
{
"name": "categories",
"type": "STRING",
"mode": "REPEATED",
"description": "List of categories to which this technology belongs"
},
{
"name": "info",
"type": "STRING",
"mode": "REPEATED",
"description": "Additional metadata about the detected technology, ie version number"
}
]
},
{
"name": "metadata",
"type": "STRING",
"mode": "NULLABLE",
"description": "Additional metadata about the test"
}
]
Median page weight over time
SELECT
# No need to parse the date or client from _TABLE_SUFFIX
date,
client,
# Existing summary stats can be extracted from the JSON-encoded `summary` field
APPROX_QUANTILES(JSON_VALUE(summary, '$.bytesTotal'), 1000)[500] AS median_page_weight
FROM
# New table address
`httparchive.all.pages`
WHERE
# The table is partitioned by date, so we don't incur any costs for data older than 2020
date >= '2020-01-01' AND
# Only measure root/home pages for consistency
is_root_page
GROUP BY
date,
client
ORDER BY
date,
client
httparchive.all.requests
All-in-one table containing all request-based results for all crawls and clients.
Partition by:
date
(required)Cluster by:
client
is_root_page
is_main_document
type
Combines data from multiple existing tables:
requests
summary_requests
Additional enhancements:
request_headers
and response_headers
fields that are well-defined repeated structs of key/value pairs, borrowing from the almanac.summary_response_bodies
tablesummary
field is JSON-encoded so we can support all historical summary_requests
stats if we'd like and add/remove anything over time without breaking the schema[
{
"name": "date",
"type": "DATE",
"mode": "REQUIRED",
"description": "YYYY-MM-DD format of the HTTP Archive monthly crawl"
},
{
"name": "client",
"type": "STRING",
"mode": "REQUIRED",
"description": "Test environment: desktop or mobile"
},
{
"name": "page",
"type": "STRING",
"mode": "REQUIRED",
"description": "The URL of the page being tested"
},
{
"name": "is_root_page",
"type": "BOOLEAN",
"mode": "NULLABLE",
"description": "Whether the page is the root of the origin."
},
{
"name": "root_page",
"type": "STRING",
"mode": "REQUIRED",
"description": "The URL of the root page being tested"
},
{
"name": "url",
"type": "STRING",
"mode": "REQUIRED",
"description": "The URL of the request"
},
{
"name": "is_main_document",
"type": "BOOLEAN",
"mode": "REQUIRED",
"description": "Whether this request corresponds with the main HTML document of the page, which is the first HTML request after redirects"
},
{
"name": "type",
"type": "STRING",
"mode": "NULLABLE",
"description": "Simplified description of the type of resource (script, html, css, text, other, etc)"
},
{
"name": "index",
"type": "INTEGER",
"mode": "NULLABLE",
"description": "The sequential 0-based index of the request"
},
{
"name": "payload",
"type": "STRING",
"mode": "NULLABLE",
"description": "JSON-encoded WebPageTest result data for this request"
},
{
"name": "summary",
"type": "STRING",
"mode": "NULLABLE",
"description": "JSON-encoded summarization of request data"
},
{
"name": "request_headers",
"type": "RECORD",
"mode": "REPEATED",
"description": "Request headers",
"fields": [
{
"name": "name",
"type": "STRING",
"mode": "NULLABLE",
"description": "Request header name"
},
{
"name": "value",
"type": "STRING",
"mode": "NULLABLE",
"description": "Request header value"
}
]
},
{
"name": "response_headers",
"type": "RECORD",
"mode": "REPEATED",
"description": "Response headers",
"fields": [
{
"name": "name",
"type": "STRING",
"mode": "NULLABLE",
"description": "Response header name"
},
{
"name": "value",
"type": "STRING",
"mode": "NULLABLE",
"description": "Response header value"
}
]
},
{
"name": "response_body",
"type": "STRING",
"mode": "NULLABLE",
"description": "Text-based response body"
}
]
Generating May 2022 data for httparchive.all.pages
:
CREATE TEMP FUNCTION GET_CUSTOM_METRICS(payload STRING) RETURNS STRING LANGUAGE js AS '''
const $ = JSON.parse(payload);
return JSON.stringify(Object.fromEntries($._custom.map(name => {
let value = $[`_${name}`];
if (typeof value == 'string') {
try {
value = JSON.parse(value);
} catch (e) {
// The value is not a JSON string.
}
}
return [name, value];
})));
''';
CREATE TEMP FUNCTION GET_FEATURES(payload STRING)
RETURNS ARRAY<STRUCT<feature STRING, id STRING, type STRING>> LANGUAGE js AS
'''
function getFeatureNames(featureMap, featureType) {
try {
return Object.entries(featureMap).map(([key, value]) => {
// After Feb 2020 keys are feature IDs.
if (value.name) {
return {'feature': value.name, 'type': featureType, 'id': key};
}
// Prior to Feb 2020 keys fell back to IDs if the name was unknown.
if (idPattern.test(key)) {
return {'feature': '', 'type': featureType, 'id': key.match(idPattern)[1]};
}
// Prior to Feb 2020 keys were names by default.
return {'feature': key, 'type': featureType, 'id': ''};
});
} catch (e) {
return [];
}
}
var $ = JSON.parse(payload);
if (!$._blinkFeatureFirstUsed) return [];
var idPattern = new RegExp('^Feature_(\\d+)$');
return getFeatureNames($._blinkFeatureFirstUsed.Features, 'default')
.concat(getFeatureNames($._blinkFeatureFirstUsed.CSSFeatures, 'css'))
.concat(getFeatureNames($._blinkFeatureFirstUsed.AnimatedCSSFeatures, 'animated-css'));
''';
WITH pages AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
JSON_VALUE(payload, '$._metadata.rank') AS rank,
JSON_VALUE(payload, '$._metadata.crawl_depth') = '0' AS is_root_page,
JSON_VALUE(payload, '$._testID') AS wptid,
GET_CUSTOM_METRICS(payload) AS custom_metrics,
JSON_QUERY(payload, '$._metadata') AS metadata,
payload
FROM
`httparchive.pages.2022_05_01_*`
# TODO: Backfill when the summary pages is ready.
/* ), summary_pages AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
TO_JSON_STRING(summary_pages) AS summary
FROM
`httparchive.summary_pages.2022_05_01_*` AS summary_pages */
), loose_technologies AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
STRUCT(
app AS technology,
ARRAY_AGG(DISTINCT category ORDER BY category) AS categories,
ARRAY_AGG(info) AS info
) AS technology
FROM
`httparchive.technologies.2022_05_01_*`
GROUP BY
client,
page,
app
), techs AS (
SELECT
client,
page,
ARRAY_AGG(technology) AS technologies
FROM
loose_technologies
GROUP BY
client,
page
), lh AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
report AS lighthouse
FROM
`httparchive.lighthouse.2022_05_01_*`
)
SELECT
DATE('2022-05-01') AS date,
client,
page,
is_root_page,
rank,
wptid,
payload,
# TODO: Update when the summary pipeline completes successfully.
'' AS summary,
custom_metrics,
lighthouse,
GET_FEATURES(payload) AS features,
technologies,
metadata
FROM
pages
LEFT JOIN
techs
USING
(client, page)
LEFT JOIN
lh
USING
(client, page)
Pages data for April 2022. Changed rank
to an INT64 field and added the required root_page
field.
CREATE TEMP FUNCTION GET_CUSTOM_METRICS(payload STRING) RETURNS STRING LANGUAGE js AS '''
const $ = JSON.parse(payload);
return JSON.stringify(Object.fromEntries($._custom.map(name => {
let value = $[`_${name}`];
if (typeof value == 'string') {
try {
value = JSON.parse(value);
} catch (e) {
// The value is not a JSON string.
}
}
return [name, value];
})));
''';
CREATE TEMP FUNCTION GET_FEATURES(payload STRING)
RETURNS ARRAY<STRUCT<feature STRING, id STRING, type STRING>> LANGUAGE js AS
'''
function getFeatureNames(featureMap, featureType) {
try {
return Object.entries(featureMap).map(([key, value]) => {
// After Feb 2020 keys are feature IDs.
if (value.name) {
return {'feature': value.name, 'type': featureType, 'id': key};
}
// Prior to Feb 2020 keys fell back to IDs if the name was unknown.
if (idPattern.test(key)) {
return {'feature': '', 'type': featureType, 'id': key.match(idPattern)[1]};
}
// Prior to Feb 2020 keys were names by default.
return {'feature': key, 'type': featureType, 'id': ''};
});
} catch (e) {
return [];
}
}
var $ = JSON.parse(payload);
if (!$._blinkFeatureFirstUsed) return [];
var idPattern = new RegExp('^Feature_(\\d+)$');
return getFeatureNames($._blinkFeatureFirstUsed.Features, 'default')
.concat(getFeatureNames($._blinkFeatureFirstUsed.CSSFeatures, 'css'))
.concat(getFeatureNames($._blinkFeatureFirstUsed.AnimatedCSSFeatures, 'animated-css'));
''';
WITH pages AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
SAFE_CAST(JSON_VALUE(payload, '$._metadata.rank') AS INT64) AS rank,
COALESCE(JSON_VALUE(payload, '$._metadata.crawl_depth') = '0', TRUE) AS is_root_page,
COALESCE(JSON_VALUE(payload, '$._metadata.root_page_url'), url) AS root_page,
JSON_VALUE(payload, '$._testID') AS wptid,
GET_CUSTOM_METRICS(payload) AS custom_metrics,
JSON_QUERY(payload, '$._metadata') AS metadata,
payload
FROM
`httparchive.pages.2022_04_01_*`
), summary_pages AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
rank,
TO_JSON_STRING(summary_pages) AS summary
FROM
`httparchive.summary_pages.2022_04_01_*` AS summary_pages
), loose_technologies AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
STRUCT(
app AS technology,
ARRAY_AGG(DISTINCT category ORDER BY category) AS categories,
ARRAY_AGG(info) AS info
) AS technology
FROM
`httparchive.technologies.2022_04_01_*`
GROUP BY
client,
page,
app
), techs AS (
SELECT
client,
page,
ARRAY_AGG(technology) AS technologies
FROM
loose_technologies
GROUP BY
client,
page
), lh AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
report AS lighthouse
FROM
`httparchive.lighthouse.2022_04_01_*`
)
SELECT
DATE('2022-04-01') AS date,
client,
page,
is_root_page,
root_page,
COALESCE(pages.rank, summary_pages.rank) AS rank,
wptid,
payload,
summary,
custom_metrics,
lighthouse,
GET_FEATURES(payload) AS features,
technologies,
metadata
FROM
pages
LEFT JOIN
summary_pages
USING
(client, page)
LEFT JOIN
techs
USING
(client, page)
LEFT JOIN
lh
USING
(client, page)
Ran into OOM issues with generating the all.requests
table directly in BQ.
Since we'll eventually need to generate tables in the new all
dataset from Dataflow, I decided to prototype how that pipeline would work. See https://github.com/HTTPArchive/bigquery/pull/170. Summary data is omitted for now since we'll be merging pipelines soon anyway.
I've successfully tested it on a single HAR and now attempting a full-scale test on the entire 2022_05_12 crawl, both desktop and mobile concurrently. I expect it to take 6-8 hours (running for ~2 so far).
The all.requests
table is clustered by the following columns:
client
is_root_page
is_main_document
type
As well as implicitly by the date
partitioning on this table.
We're only allowed 4 cluster fields meaning we cannot add any more. However I'm wondering if is_main_document
is that useful? It might be much more beneficial to cluster based on page
to allow us to cheaply join to the all.pages
table. We could still basically filter for is_main_document
relatively cheaply by looking at html
type and then filtering to is_main_document
(appreciate not ALL main documents will be type HTML but most should be). Changing to page
would allow us to filter based on rank, and technologies (again both via join) and similarly other items.
WDYT?
is_main_document
is equivalent to the old firstHtml
field, so having that parity might make migrating to the new schema easier. But I'm happy to explore other faster/cheaper schemas. Would you be willing to create a temp table with the proposed schema and benchmark a few test queries against each?
If I query for
SELECT
JSON_VALUE(lighthouse, '$.finalUrl') AS final_url,
FROM `httparchive.all.pages`
WHERE
date = "2022-10-01"
AND client = 'mobile'
AND is_root_page
The BQ UI warns me I'm going to be querying 13.65 TB, vs 4.63 TB for the (currently) equivalent
SELECT
JSON_VALUE(report, '$.finalUrl') AS final_url,
FROM `httparchive.lighthouse.2022_10_01_mobile`
Is there some partitioning/clustering trick I should be doing to bring the all
number more in line with the lighthouse
number? Or more that could be done to the table structure? That would be an unfortunate cost increase for querying the raw data.
Weirdly I don't see that and they look similar cost to me
Sure you're "holding it right"?
huh, so weird. Pasting into a fresh editor, same 13.65 TB
could being a member of the project somehow affect that? I don't see how, but not sure what other difference there could be.
The old BQ estimation behavior was to not take clustered fields into consideration. I'm also seeing the new behavior showing the 4.67 TB estimate, so maybe they've only partially rolled it out.
via the CLI:
bq query --nouse_legacy_sql --dry_run 'SELECT
JSON_VALUE(lighthouse, "$.finalUrl") AS final_url,
FROM `httparchive.all.pages`
WHERE
date = "2022-10-01"
AND client = "mobile"
AND is_root_page'
also gives me "this query will process upper bound of 15012318480052 bytes of data" (aka 13.65 TiB). Can either of you try it with a secondary cloud project to see if you get the same numbers?
If everyone eventually gets the lower number, obviously this isn't an actual issue.
I see 13.65TB when using a random account. Hmm....
If you run the query it should still only process 4.67 TiB
Well, here is the current plan for this reorganization: https://docs.google.com/document/d/1kNCpVgvC0W77HJzXiNvlchuGjCCY6uymIfv0nH10tF0/edit
Similar to the
almanac
dataset, we want the monthly results to be queryable in partitioned and clustered tables.We'll need a deprecation plan to avoid suddenly breaking existing queries:
generate_reports
)