HTTPArchive / data-pipeline

The new HTTP Archive data pipeline built entirely on GCP
Apache License 2.0
5 stars 0 forks source link

Split the custom metrics JSON into structured fields #262

Closed rviscomi closed 4 weeks ago

rviscomi commented 7 months ago

The custom_metrics field of the pages table is a JSON blob containing all 50+ custom metrics. Querying ANY custom metric is as expensive as querying ALL custom metrics. As of March 2024, querying over all custom metrics (desktop and mobile, root and secondary pages) processes 7.91 TB and takes about 4 minutes.

The reasoning for having all custom metrics in a big blob as opposed to a well-defined BigQuery struct was to avoid having to change the schema whenever custom metrics were added/removed. This provides simplicity and consistency for queries that process data over many months.

An alternative approach that both reduces query costs and minimizes schema changes would be to extract a few core custom metrics and make them available in a struct of smaller blobs. The core custom metrics would include ones like javascript.js, media.js, and performance.js. As a rule of thumb, custom metrics corresponding to individual chapters in the Web Almanac could be eligible for this core subset. All remaining custom metrics would be made available in a JSON blob named other.

So instead of a single custom_metrics field of type STRING, there would be a custom_metrics field of type STRUCT containing named fields corresponding to the core custom metrics.


As a proof of concept, here's a query that creates a scratchspace table with the performance custom metric extracted into its own field, with everything else in an other field:

CREATE TEMP FUNCTION GET_CUSTOM_METRICS(custom_metrics STRING)
RETURNS STRUCT<performance STRING, other STRING> LANGUAGE js AS '''
  const topLevelMetrics = new Set([
    'performance'
  ]);
  try {
    custom_metrics = JSON.parse(custom_metrics);
  } catch {
    return {};
  }

  if (!custom_metrics) {
    return {};
  }

  const performance = JSON.stringify(custom_metrics.performance);
  delete custom_metrics.performance;

  const other = JSON.stringify(custom_metrics);

  return {
    performance,
    other
  }
''';

CREATE OR REPLACE TABLE `httparchive.scratchspace.custom_metrics_struct`
PARTITION BY date
CLUSTER BY client, is_root_page, rank
AS SELECT
  * EXCEPT (custom_metrics),
  GET_CUSTOM_METRICS(custom_metrics) AS custom_metrics
FROM
  `httparchive.all.pages`
WHERE
  date = '2024-03-01'

Running an example query over the existing 2024-03-01 dataset processes 4.35 TB in 4 min 33 sec.

Here's the relevant part of that example query showing how it would look using the new schema in the scratchspace table:

WITH lcp_stats AS (
  SELECT
    client,
    isLazyLoaded(JSON_EXTRACT(custom_metrics.performance, '$.lcp_elem_stats.attributes')) AS native_lazy,
    hasLazyHeuristics(JSON_EXTRACT(custom_metrics.performance, '$.lcp_elem_stats.attributes')) AS custom_lazy
  FROM
    `httparchive.scratchspace.custom_metrics_struct`
  WHERE
    date = '2024-03-01' AND
    is_root_page AND
    JSON_EXTRACT_SCALAR(custom_metrics.performance, '$.lcp_elem_stats.nodeName') = 'IMG'
)

This query returns the same result, but only processes 123.78 GB in 53 sec. Or, 3% of the amount of data processed in 19% of the time with no loss of quality.

rviscomi commented 7 months ago

I'd consider this and https://github.com/HTTPArchive/data-pipeline/issues/189 (add rank to the requests table) and https://github.com/HTTPArchive/data-pipeline/issues/149 (optimizing summary fields) to be the last schema changes before considering the new all dataset relatively stable.

max-ostapenko commented 1 month ago

I see two ways we can optimize processing cost/performance of custom metrics:

  1. split the popular custom metrics,
  2. split the biggest ones.

I'm still looking into the most popular ones, but here are the biggest:

  1. css-variables 38.06%
  2. parsed_css 18.54%
  3. responsive_images 12.95%

Note: we can add a record child column with ALTER, but can't remove it (at least currently). So we should split only the future-proof metrics as record fields. It's much easier with non-record columns.

Some insights into the metrics sizes

CREATE TEMP FUNCTION calculate_object_size(cm STRING)
RETURNS ARRAY<STRUCT<metric STRING, chars INT64>>
LANGUAGE js AS """
  let result = [];
  cm = JSON.parse(cm);
  for (const property in cm) {
    let length = JSON.stringify(cm[property]).length;
    result.push({ metric: property, chars: length });
  }
  return result;
""";

SELECT
  size.metric,
  APPROX_QUANTILES(size.chars, 2) AS chars_quantiles,
  SUM(size.chars) AS chars_sum,
  COUNT(DISTINCT page) AS cnt_pages
FROM `all.pages` TABLESAMPLE SYSTEM (3 PERCENT),
  UNNEST(calculate_object_size(custom_metrics)) AS size
WHERE date = '2024-07-01'
GROUP BY 1
ORDER BY chars_sum DESC

Questions:

max-ostapenko commented 1 month ago

@tunetheweb Found an issue about parsed_css cleanup but was it not completed?

tunetheweb commented 1 month ago

Hmmm it appears some of this fails on SAFE.PARSE_JSON(custom_metrics) IS NOT NULL so it was excluded from the cleanup :-(

Doing this in JavaScript seems to work better from a quick test (I guess it has a better JSON parser than BigQuery) but is an awful lot slower :-(

CREATE TEMP FUNCTION remove_parsed_css(cm STRING)
RETURNS STRING
LANGUAGE js AS """
  try {
    cm = JSON.parse(cm);
    delete cm.parsed_css;
    return JSON.stringify(cm);
  } catch {
    return ''
  }
""";

UPDATE
  `httparchive.all.pages`
SET
  custom_metrics = remove_parsed_css(custom_metrics)
WHERE
  date = '2024-08-01' AND
  custom_metrics LIKE '%parsed_css%' AND
  remove_parsed_css(custom_metrics) != '' AND
  client = 'desktop' AND
  is_root_page
pmeenan commented 1 month ago
  • @pmeenan isn't parsed_css split into a separate table? The content seems very similar, though the amount of data is not.

parsed_css was removed from the custom metrics at the end of last month and should not be there in September's crawl.

tunetheweb commented 1 month ago

OK parsed_css all removed now (minus 10 pages or so that have invalid JSON).

You stats are super interesting!

@tunetheweb is css-variables related to the parsed_css?

I think so. Maybe it makes sense to move that into the parsed_css table too and have a type column to differentiate them? The fact this is > 50% of the custom_metric would really help slim the custom_metrics down to a more reasonable size. Then maybe we could keep the rest in there? Or could move a few of the chunkier ones (responsive images, almanac...etc) to their own column and keep the rest as is.

pmeenan commented 1 month ago

Also worth revisiting the metrics themselves to be certain they need all that data.

On Thu, Sep 12, 2024 at 3:58 PM Barry Pollard @.***> wrote:

OK parsed_css all removed now (minus 10 pages or so that have invalid JSON).

You stats are super interesting!

@tunetheweb https://github.com/tunetheweb is css-variables related to the parsed_css?

I think so. Maybe it makes sense to move that into the parsed_css table too and have a type column to differentiate them? The fact this is > 50% of the custom_metric would really help slim the custom_metrics down to a more reasonable size. Then maybe we could keep the rest in there? Or could move a few of the chunkier ones (responsive images, almanac...etc) to their own column and keep the rest as it.

— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/data-pipeline/issues/262#issuecomment-2347126314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBL2MVXA25JC4ITVKCLZWHW4VAVCNFSM6AAAAABF6WLWK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXGEZDMMZRGQ . You are receiving this because you were mentioned.Message ID: @.***>

tunetheweb commented 1 month ago

True dat! I don't know the CSS ones well enough but had a look at the responsive images one, and it doesn't look hat unreasonable. But it does stores quite a bit of meta data about every image on the page, and for pages with many images that adds up!:

            {
                "hasSrc": true,
                "hasAlt": true,
                "isInPicture": false,
                "hasCustomDataAttributes": false,
                "hasWidth": false,
                "hasHeight": false,
                "url": "http://www.turistickelisty.sportovnilisty.cz/img/_antialias_a8a44905-cc85-435b-bdb8-8a333c2a7498_acee3968cd95feb2c683ac90ecad8b1d.png",
                "totalCandidates": 1,
                "altAttribute": "",
                "clientWidth": 950,
                "clientHeight": 200,
                "naturalWidth": 950,
                "naturalHeight": 200,
                "hasSrcset": false,
                "hasSizes": false,
                "currentSrcDensity": 1,
                "approximateResourceWidth": 950,
                "approximateResourceHeight": 200,
                "byteSize": 274637,
                "bitsPerPixel": 11.563663157894737,
                "mimeType": null,
                "computedSizingStyles": {
                    "width": "auto",
                    "height": "auto",
                    "maxWidth": "100%",
                    "maxHeight": "none",
                    "minWidth": "auto",
                    "minHeight": "auto"
                },
                "intrinsicOrExtrinsicSizing": {
                    "width": "both",
                    "height": "intrinsic"
                },
                "reservedLayoutDimensions": false
            },
pmeenan commented 1 month ago

Heck, even shortening the keys (and maybe keep a comment in the custom metric code) could have a big impact. like "nW" for "naturalWidth", "aW" for "approximateResourceWidth", etc. Though to be backward compatible it would make the queries a bit more complicated

On Thu, Sep 12, 2024 at 5:18 PM Barry Pollard @.***> wrote:

True dat! I don't know the CSS ones well enough but had a look at the responsive images one, and it doesn't look hat unreasonable. But it does stores quite a bit of meta data about every image on the page, and for pages with many images that adds up!:

        {
            "hasSrc": true,
            "hasAlt": true,
            "isInPicture": false,
            "hasCustomDataAttributes": false,
            "hasWidth": false,
            "hasHeight": false,
            "url": "http://www.turistickelisty.sportovnilisty.cz/img/_antialias_a8a44905-cc85-435b-bdb8-8a333c2a7498_acee3968cd95feb2c683ac90ecad8b1d.png",
            "totalCandidates": 1,
            "altAttribute": "",
            "clientWidth": 950,
            "clientHeight": 200,
            "naturalWidth": 950,
            "naturalHeight": 200,
            "hasSrcset": false,
            "hasSizes": false,
            "currentSrcDensity": 1,
            "approximateResourceWidth": 950,
            "approximateResourceHeight": 200,
            "byteSize": 274637,
            "bitsPerPixel": 11.563663157894737,
            "mimeType": null,
            "computedSizingStyles": {
                "width": "auto",
                "height": "auto",
                "maxWidth": "100%",
                "maxHeight": "none",
                "minWidth": "auto",
                "minHeight": "auto"
            },
            "intrinsicOrExtrinsicSizing": {
                "width": "both",
                "height": "intrinsic"
            },
            "reservedLayoutDimensions": false
        },

— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/data-pipeline/issues/262#issuecomment-2347262131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBKFDKQ7UJ7VFFEP2DDZWIAL7AVCNFSM6AAAAABF6WLWK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXGI3DEMJTGE . You are receiving this because you were mentioned.Message ID: @.***>

pmeenan commented 1 month ago

I'd probably also want to make sure the URLs for things don't include full data URI's - maybe truncate anything data: to just that (or include a length).

On Thu, Sep 12, 2024 at 5:44 PM Patrick Meenan @.***> wrote:

Heck, even shortening the keys (and maybe keep a comment in the custom metric code) could have a big impact. like "nW" for "naturalWidth", "aW" for "approximateResourceWidth", etc. Though to be backward compatible it would make the queries a bit more complicated

On Thu, Sep 12, 2024 at 5:18 PM Barry Pollard @.***> wrote:

True dat! I don't know the CSS ones well enough but had a look at the responsive images one, and it doesn't look hat unreasonable. But it does stores quite a bit of meta data about every image on the page, and for pages with many images that adds up!:

        {
            "hasSrc": true,
            "hasAlt": true,
            "isInPicture": false,
            "hasCustomDataAttributes": false,
            "hasWidth": false,
            "hasHeight": false,
            "url": "http://www.turistickelisty.sportovnilisty.cz/img/_antialias_a8a44905-cc85-435b-bdb8-8a333c2a7498_acee3968cd95feb2c683ac90ecad8b1d.png",
            "totalCandidates": 1,
            "altAttribute": "",
            "clientWidth": 950,
            "clientHeight": 200,
            "naturalWidth": 950,
            "naturalHeight": 200,
            "hasSrcset": false,
            "hasSizes": false,
            "currentSrcDensity": 1,
            "approximateResourceWidth": 950,
            "approximateResourceHeight": 200,
            "byteSize": 274637,
            "bitsPerPixel": 11.563663157894737,
            "mimeType": null,
            "computedSizingStyles": {
                "width": "auto",
                "height": "auto",
                "maxWidth": "100%",
                "maxHeight": "none",
                "minWidth": "auto",
                "minHeight": "auto"
            },
            "intrinsicOrExtrinsicSizing": {
                "width": "both",
                "height": "intrinsic"
            },
            "reservedLayoutDimensions": false
        },

— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/data-pipeline/issues/262#issuecomment-2347262131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBKFDKQ7UJ7VFFEP2DDZWIAL7AVCNFSM6AAAAABF6WLWK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXGI3DEMJTGE . You are receiving this because you were mentioned.Message ID: @.***>

tunetheweb commented 1 month ago

Heck, even shortening the keys (and maybe keep a comment in the custom metric code) could have a big impact. like "nW" for "naturalWidth", "aW" for "approximateResourceWidth", etc. Though to be backward compatible it would make the queries a bit more complicated

Yup! JSON is way too chatty. Now I know how to update JSON we could look at shorting these down. Let's wait until Web Almanac calms down and do that.

I'd probably also want to make sure the URLs for things don't include full data URI's - maybe truncate anything data: to just that (or include a length).

Great shout. I checked and they ARE included. Fix in https://github.com/HTTPArchive/custom-metrics/pull/143

max-ostapenko commented 1 month ago

Also worth revisiting the metrics themselves to be certain they need all that data.

Yes, we should be able to identify and warn when storing such raw data to custom_metrics.

@tunetheweb Seems such data URLs are also a reason behind the huge size of css_variables metrics.

Example:

SELECT
  JSON_QUERY(custom_metrics, "$.css-variables.summary.--playkit-icon-quality-HD-active-url.set.0.value")
FROM `httparchive.sample_data.pages_10k`
WHERE
  client = 'mobile' AND
  page = "https://www.morgenpost.de/politik/article407241629/game-changer-fuer-die-ukraine-bahnt-sich-an-putin-droht.html"
max-ostapenko commented 1 month ago

After looking into frequency of custom_metrics usage, it doesn't seem as there is much usage outside of Web Almanac.

These are the most popular as of now:

  1. ecommerce
  2. almanac (https://github.com/HTTPArchive/custom-metrics/issues/88)
  3. element_count
  4. markup
  5. wpt_bodies
  6. performance
  7. cookies

Log analytics query:

SELECT  
  metrics,
  --ARRAY_AGG(DISTINCT project IGNORE NULLS) AS projects,
  ARRAY_AGG(DISTINCT principal_email IGNORE NULLS) AS accounts,
  COUNT(DISTINCT insert_id) AS cnt
FROM (
  SELECT
    JSON_VALUE(COALESCE(proto_payload.audit_log.service_data.jobCompletedEvent.job.jobName.projectId, proto_payload.audit_log.service_data.jobInsertResponse.resource.jobName.projectId, proto_payload.audit_log.service_data.jobGetQueryResultsResponse.job.jobName.projectId)) AS project,
    proto_payload.audit_log.authentication_info.principal_email AS principal_email,
    insert_id,
    proto_payload
  FROM `httparchive.global._Default._Default`
  WHERE
    proto_payload.audit_log.service_name="bigquery.googleapis.com"
),
 UNNEST(REGEXP_EXTRACT_ALL(JSON_VALUE(proto_payload.audit_log.service_data.jobInsertRequest.resource.jobConfiguration.query.query), r"\(custom_metrics\,\ '\$\.([a-z0-9_]+)")) AS metrics
WHERE
   REGEXP_CONTAINS(JSON_VALUE(proto_payload.audit_log.service_data.jobInsertRequest.resource.jobConfiguration.query.query), "custom_metrics")
GROUP BY 1
ORDER BY cnt DESC

And we know external analysts who use custom metrics, but don't know which metrics exactly:

SELECT  
  project,
  principal_email,
  COUNT(DISTINCT insert_id) AS cnt
FROM (
  SELECT
    COALESCE(JSON_VALUE(COALESCE(proto_payload.audit_log.service_data.jobCompletedEvent.job.jobName.projectId, proto_payload.audit_log.service_data.jobInsertResponse.resource.jobName.projectId, proto_payload.audit_log.service_data.jobGetQueryResultsResponse.job.jobName.projectId)), REGEXP_EXTRACT(JSON_VALUE(proto_payload.audit_log.metadata.tableDataRead.jobName), '^projects/([^/]+)/jobs/')) AS project,
    proto_payload.audit_log.authentication_info.principal_email AS principal_email,
    insert_id,
    proto_payload
  FROM `httparchive.global._Default._Default`
  WHERE
    proto_payload.audit_log.service_name="bigquery.googleapis.com"
)
WHERE
  "custom_metrics" IN UNNEST(JSON_VALUE_ARRAY(proto_payload.audit_log.metadata.tableDataRead.fields)) AND
  project != 'httparchive'
GROUP BY 1,2
ORDER BY cnt DESC

Updated insights spreadsheet We can get down to 30% of column size if we review the metrics themselves (~900Gb for is_root_page AND client = 'mobile').

@rviscomi why did you consider javascript, media, and performance as 'core' and ready to be split?

eeeps commented 1 month ago

As long as we're talking about the responsive images metrics specifically - IIRC there is some overlap between Images.js, media.js, responsive-images.js, and img-loading-atrr.js. I have thought about trying to consolidate but was worried about disrupting analysis over time.

max-ostapenko commented 1 month ago

@eeeps if you are working with images metrics data and have enhancement suggestions - it may be a good time to look at it now. We will be processing historical data to update the table schema, so metrics could be consolidated retrospectively.