Closed rviscomi closed 1 month ago
I'd consider this and https://github.com/HTTPArchive/data-pipeline/issues/189 (add rank
to the requests table) and https://github.com/HTTPArchive/data-pipeline/issues/149 (optimizing summary fields) to be the last schema changes before considering the new all
dataset relatively stable.
I see two ways we can optimize processing cost/performance of custom metrics:
I'm still looking into the most popular ones, but here are the biggest:
Note: we can add a record child column with ALTER
, but can't remove it (at least currently).
So we should split only the future-proof metrics as record fields. It's much easier with non-record columns.
Some insights into the metrics sizes
CREATE TEMP FUNCTION calculate_object_size(cm STRING)
RETURNS ARRAY<STRUCT<metric STRING, chars INT64>>
LANGUAGE js AS """
let result = [];
cm = JSON.parse(cm);
for (const property in cm) {
let length = JSON.stringify(cm[property]).length;
result.push({ metric: property, chars: length });
}
return result;
""";
SELECT
size.metric,
APPROX_QUANTILES(size.chars, 2) AS chars_quantiles,
SUM(size.chars) AS chars_sum,
COUNT(DISTINCT page) AS cnt_pages
FROM `all.pages` TABLESAMPLE SYSTEM (3 PERCENT),
UNNEST(calculate_object_size(custom_metrics)) AS size
WHERE date = '2024-07-01'
GROUP BY 1
ORDER BY chars_sum DESC
Questions:
parsed_css
split into a separate table? The content seems very similar, though the amount of data is not.css-variables
related to the parsed_css
?@tunetheweb Found an issue about parsed_css
cleanup but was it not completed?
Hmmm it appears some of this fails on SAFE.PARSE_JSON(custom_metrics) IS NOT NULL
so it was excluded from the cleanup :-(
Doing this in JavaScript seems to work better from a quick test (I guess it has a better JSON parser than BigQuery) but is an awful lot slower :-(
CREATE TEMP FUNCTION remove_parsed_css(cm STRING)
RETURNS STRING
LANGUAGE js AS """
try {
cm = JSON.parse(cm);
delete cm.parsed_css;
return JSON.stringify(cm);
} catch {
return ''
}
""";
UPDATE
`httparchive.all.pages`
SET
custom_metrics = remove_parsed_css(custom_metrics)
WHERE
date = '2024-08-01' AND
custom_metrics LIKE '%parsed_css%' AND
remove_parsed_css(custom_metrics) != '' AND
client = 'desktop' AND
is_root_page
- @pmeenan isn't
parsed_css
split into a separate table? The content seems very similar, though the amount of data is not.
parsed_css was removed from the custom metrics at the end of last month and should not be there in September's crawl.
OK parsed_css all removed now (minus 10 pages or so that have invalid JSON).
You stats are super interesting!
@tunetheweb is
css-variables
related to theparsed_css
?
I think so. Maybe it makes sense to move that into the parsed_css
table too and have a type
column to differentiate them? The fact this is > 50% of the custom_metric would really help slim the custom_metrics
down to a more reasonable size. Then maybe we could keep the rest in there? Or could move a few of the chunkier ones (responsive images, almanac...etc) to their own column and keep the rest as is.
Also worth revisiting the metrics themselves to be certain they need all that data.
On Thu, Sep 12, 2024 at 3:58 PM Barry Pollard @.***> wrote:
OK parsed_css all removed now (minus 10 pages or so that have invalid JSON).
You stats are super interesting!
@tunetheweb https://github.com/tunetheweb is css-variables related to the parsed_css?
I think so. Maybe it makes sense to move that into the parsed_css table too and have a type column to differentiate them? The fact this is > 50% of the custom_metric would really help slim the custom_metrics down to a more reasonable size. Then maybe we could keep the rest in there? Or could move a few of the chunkier ones (responsive images, almanac...etc) to their own column and keep the rest as it.
— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/data-pipeline/issues/262#issuecomment-2347126314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBL2MVXA25JC4ITVKCLZWHW4VAVCNFSM6AAAAABF6WLWK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXGEZDMMZRGQ . You are receiving this because you were mentioned.Message ID: @.***>
True dat! I don't know the CSS ones well enough but had a look at the responsive images one, and it doesn't look hat unreasonable. But it does stores quite a bit of meta data about every image on the page, and for pages with many images that adds up!:
{
"hasSrc": true,
"hasAlt": true,
"isInPicture": false,
"hasCustomDataAttributes": false,
"hasWidth": false,
"hasHeight": false,
"url": "http://www.turistickelisty.sportovnilisty.cz/img/_antialias_a8a44905-cc85-435b-bdb8-8a333c2a7498_acee3968cd95feb2c683ac90ecad8b1d.png",
"totalCandidates": 1,
"altAttribute": "",
"clientWidth": 950,
"clientHeight": 200,
"naturalWidth": 950,
"naturalHeight": 200,
"hasSrcset": false,
"hasSizes": false,
"currentSrcDensity": 1,
"approximateResourceWidth": 950,
"approximateResourceHeight": 200,
"byteSize": 274637,
"bitsPerPixel": 11.563663157894737,
"mimeType": null,
"computedSizingStyles": {
"width": "auto",
"height": "auto",
"maxWidth": "100%",
"maxHeight": "none",
"minWidth": "auto",
"minHeight": "auto"
},
"intrinsicOrExtrinsicSizing": {
"width": "both",
"height": "intrinsic"
},
"reservedLayoutDimensions": false
},
Heck, even shortening the keys (and maybe keep a comment in the custom metric code) could have a big impact. like "nW" for "naturalWidth", "aW" for "approximateResourceWidth", etc. Though to be backward compatible it would make the queries a bit more complicated
On Thu, Sep 12, 2024 at 5:18 PM Barry Pollard @.***> wrote:
True dat! I don't know the CSS ones well enough but had a look at the responsive images one, and it doesn't look hat unreasonable. But it does stores quite a bit of meta data about every image on the page, and for pages with many images that adds up!:
{ "hasSrc": true, "hasAlt": true, "isInPicture": false, "hasCustomDataAttributes": false, "hasWidth": false, "hasHeight": false, "url": "http://www.turistickelisty.sportovnilisty.cz/img/_antialias_a8a44905-cc85-435b-bdb8-8a333c2a7498_acee3968cd95feb2c683ac90ecad8b1d.png", "totalCandidates": 1, "altAttribute": "", "clientWidth": 950, "clientHeight": 200, "naturalWidth": 950, "naturalHeight": 200, "hasSrcset": false, "hasSizes": false, "currentSrcDensity": 1, "approximateResourceWidth": 950, "approximateResourceHeight": 200, "byteSize": 274637, "bitsPerPixel": 11.563663157894737, "mimeType": null, "computedSizingStyles": { "width": "auto", "height": "auto", "maxWidth": "100%", "maxHeight": "none", "minWidth": "auto", "minHeight": "auto" }, "intrinsicOrExtrinsicSizing": { "width": "both", "height": "intrinsic" }, "reservedLayoutDimensions": false },
— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/data-pipeline/issues/262#issuecomment-2347262131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBKFDKQ7UJ7VFFEP2DDZWIAL7AVCNFSM6AAAAABF6WLWK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXGI3DEMJTGE . You are receiving this because you were mentioned.Message ID: @.***>
I'd probably also want to make sure the URLs for things don't include full data URI's - maybe truncate anything data: to just that (or include a length).
On Thu, Sep 12, 2024 at 5:44 PM Patrick Meenan @.***> wrote:
Heck, even shortening the keys (and maybe keep a comment in the custom metric code) could have a big impact. like "nW" for "naturalWidth", "aW" for "approximateResourceWidth", etc. Though to be backward compatible it would make the queries a bit more complicated
On Thu, Sep 12, 2024 at 5:18 PM Barry Pollard @.***> wrote:
True dat! I don't know the CSS ones well enough but had a look at the responsive images one, and it doesn't look hat unreasonable. But it does stores quite a bit of meta data about every image on the page, and for pages with many images that adds up!:
{ "hasSrc": true, "hasAlt": true, "isInPicture": false, "hasCustomDataAttributes": false, "hasWidth": false, "hasHeight": false, "url": "http://www.turistickelisty.sportovnilisty.cz/img/_antialias_a8a44905-cc85-435b-bdb8-8a333c2a7498_acee3968cd95feb2c683ac90ecad8b1d.png", "totalCandidates": 1, "altAttribute": "", "clientWidth": 950, "clientHeight": 200, "naturalWidth": 950, "naturalHeight": 200, "hasSrcset": false, "hasSizes": false, "currentSrcDensity": 1, "approximateResourceWidth": 950, "approximateResourceHeight": 200, "byteSize": 274637, "bitsPerPixel": 11.563663157894737, "mimeType": null, "computedSizingStyles": { "width": "auto", "height": "auto", "maxWidth": "100%", "maxHeight": "none", "minWidth": "auto", "minHeight": "auto" }, "intrinsicOrExtrinsicSizing": { "width": "both", "height": "intrinsic" }, "reservedLayoutDimensions": false },
— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/data-pipeline/issues/262#issuecomment-2347262131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBKFDKQ7UJ7VFFEP2DDZWIAL7AVCNFSM6AAAAABF6WLWK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXGI3DEMJTGE . You are receiving this because you were mentioned.Message ID: @.***>
Heck, even shortening the keys (and maybe keep a comment in the custom metric code) could have a big impact. like "nW" for "naturalWidth", "aW" for "approximateResourceWidth", etc. Though to be backward compatible it would make the queries a bit more complicated
Yup! JSON is way too chatty. Now I know how to update JSON we could look at shorting these down. Let's wait until Web Almanac calms down and do that.
I'd probably also want to make sure the URLs for things don't include full data URI's - maybe truncate anything data: to just that (or include a length).
Great shout. I checked and they ARE included. Fix in https://github.com/HTTPArchive/custom-metrics/pull/143
Also worth revisiting the metrics themselves to be certain they need all that data.
Yes, we should be able to identify and warn when storing such raw data to custom_metrics.
@tunetheweb Seems such data URLs are also a reason behind the huge size of css_variables
metrics.
Example:
SELECT
JSON_QUERY(custom_metrics, "$.css-variables.summary.--playkit-icon-quality-HD-active-url.set.0.value")
FROM `httparchive.sample_data.pages_10k`
WHERE
client = 'mobile' AND
page = "https://www.morgenpost.de/politik/article407241629/game-changer-fuer-die-ukraine-bahnt-sich-an-putin-droht.html"
After looking into frequency of custom_metrics
usage, it doesn't seem as there is much usage outside of Web Almanac.
These are the most popular as of now:
Log analytics query:
SELECT
metrics,
--ARRAY_AGG(DISTINCT project IGNORE NULLS) AS projects,
ARRAY_AGG(DISTINCT principal_email IGNORE NULLS) AS accounts,
COUNT(DISTINCT insert_id) AS cnt
FROM (
SELECT
JSON_VALUE(COALESCE(proto_payload.audit_log.service_data.jobCompletedEvent.job.jobName.projectId, proto_payload.audit_log.service_data.jobInsertResponse.resource.jobName.projectId, proto_payload.audit_log.service_data.jobGetQueryResultsResponse.job.jobName.projectId)) AS project,
proto_payload.audit_log.authentication_info.principal_email AS principal_email,
insert_id,
proto_payload
FROM `httparchive.global._Default._Default`
WHERE
proto_payload.audit_log.service_name="bigquery.googleapis.com"
),
UNNEST(REGEXP_EXTRACT_ALL(JSON_VALUE(proto_payload.audit_log.service_data.jobInsertRequest.resource.jobConfiguration.query.query), r"\(custom_metrics\,\ '\$\.([a-z0-9_]+)")) AS metrics
WHERE
REGEXP_CONTAINS(JSON_VALUE(proto_payload.audit_log.service_data.jobInsertRequest.resource.jobConfiguration.query.query), "custom_metrics")
GROUP BY 1
ORDER BY cnt DESC
And we know external analysts who use custom metrics, but don't know which metrics exactly:
SELECT
project,
principal_email,
COUNT(DISTINCT insert_id) AS cnt
FROM (
SELECT
COALESCE(JSON_VALUE(COALESCE(proto_payload.audit_log.service_data.jobCompletedEvent.job.jobName.projectId, proto_payload.audit_log.service_data.jobInsertResponse.resource.jobName.projectId, proto_payload.audit_log.service_data.jobGetQueryResultsResponse.job.jobName.projectId)), REGEXP_EXTRACT(JSON_VALUE(proto_payload.audit_log.metadata.tableDataRead.jobName), '^projects/([^/]+)/jobs/')) AS project,
proto_payload.audit_log.authentication_info.principal_email AS principal_email,
insert_id,
proto_payload
FROM `httparchive.global._Default._Default`
WHERE
proto_payload.audit_log.service_name="bigquery.googleapis.com"
)
WHERE
"custom_metrics" IN UNNEST(JSON_VALUE_ARRAY(proto_payload.audit_log.metadata.tableDataRead.fields)) AND
project != 'httparchive'
GROUP BY 1,2
ORDER BY cnt DESC
Updated insights spreadsheet
We can get down to 30% of column size if we review the metrics themselves (~900Gb for is_root_page AND client = 'mobile'
).
@rviscomi why did you consider javascript, media, and performance as 'core' and ready to be split?
As long as we're talking about the responsive images metrics specifically - IIRC there is some overlap between Images.js, media.js, responsive-images.js, and img-loading-atrr.js. I have thought about trying to consolidate but was worried about disrupting analysis over time.
@eeeps if you are working with images metrics data and have enhancement suggestions - it may be a good time to look at it now. We will be processing historical data to update the table schema, so metrics could be consolidated retrospectively.
The
custom_metrics
field of thepages
table is a JSON blob containing all 50+ custom metrics. Querying ANY custom metric is as expensive as querying ALL custom metrics. As of March 2024, querying over all custom metrics (desktop and mobile, root and secondary pages) processes 7.91 TB and takes about 4 minutes.The reasoning for having all custom metrics in a big blob as opposed to a well-defined BigQuery struct was to avoid having to change the schema whenever custom metrics were added/removed. This provides simplicity and consistency for queries that process data over many months.
An alternative approach that both reduces query costs and minimizes schema changes would be to extract a few core custom metrics and make them available in a struct of smaller blobs. The core custom metrics would include ones like javascript.js, media.js, and performance.js. As a rule of thumb, custom metrics corresponding to individual chapters in the Web Almanac could be eligible for this core subset. All remaining custom metrics would be made available in a JSON blob named
other
.So instead of a single
custom_metrics
field of typeSTRING
, there would be acustom_metrics
field of typeSTRUCT
containing named fields corresponding to the core custom metrics.As a proof of concept, here's a query that creates a scratchspace table with the
performance
custom metric extracted into its own field, with everything else in another
field:Running an example query over the existing 2024-03-01 dataset processes 4.35 TB in 4 min 33 sec.
Here's the relevant part of that example query showing how it would look using the new schema in the scratchspace table:
This query returns the same result, but only processes 123.78 GB in 53 sec. Or, 3% of the amount of data processed in 19% of the time with no loss of quality.