HTTPArchive / httparchive.org

The HTTP Archive website hosted on App Engine
https://httparchive.org
Apache License 2.0
332 stars 42 forks source link

The massive `parsed_css` custom metric is being included in the pages table #923

Closed tunetheweb closed 3 weeks ago

tunetheweb commented 3 weeks ago

Looks like this started happening from June:

This is 8.32TB:

SELECT page, custom_metrics FROM `httparchive.all.pages` WHERE date = '2024-05-01'

This is 63.29TB!!!!!!!:

SELECT page, custom_metrics FROM `httparchive.all.pages` WHERE date = '2024-06-01'

Will delete from June, July and August, but can you have a look at preventing this happening @pmeenan ?

pmeenan commented 3 weeks ago

Ahh. Looks like the code was stripping it out of "payload" (the raw full JSON) but not from the extracted custom_metrics. Cleaning it up now.

tunetheweb commented 3 weeks ago

Running this to clean it up:

UPDATE
  `httparchive.all.pages`
SET
  custom_metrics = TO_JSON_STRING(JSON_REMOVE(SAFE.PARSE_JSON(custom_metrics),'$.parsed_css'))
WHERE
  date = '2024-08-01' AND
  custom_metrics IS NOT NULL AND
  SAFE.PARSE_JSON(custom_metrics) IS NOT NULL;

UPDATE
  `httparchive.all.pages`
SET
  custom_metrics = TO_JSON_STRING(JSON_REMOVE(SAFE.PARSE_JSON(custom_metrics),'$.parsed_css'))
WHERE
  date = '2024-07-01' AND
  custom_metrics IS NOT NULL AND
  SAFE.PARSE_JSON(custom_metrics) IS NOT NULL;

UPDATE
  `httparchive.all.pages`
SET
  custom_metrics = TO_JSON_STRING(JSON_REMOVE(SAFE.PARSE_JSON(custom_metrics),'$.parsed_css'))
WHERE
  date = '2024-06-01' AND
  custom_metrics IS NOT NULL AND
  SAFE.PARSE_JSON(custom_metrics) IS NOT NULL;
tunetheweb commented 4 days ago

Turns out PARSE_JSON is a bit limited in BigQuery so had to do this via JavaScript instead (and repeat for July and June):

CREATE TEMP FUNCTION remove_parsed_css(cm STRING)
RETURNS STRING
LANGUAGE js AS """
  try {
    cm = JSON.parse(cm);
    delete cm.parsed_css;
    return JSON.stringify(cm);
  } catch {
    return ''
  }
""";

UPDATE
  `httparchive.all.pages`
SET
  custom_metrics = remove_parsed_css(custom_metrics)
WHERE
  date = '2024-08-01' AND
  custom_metrics LIKE '%parsed_css%' AND
  remove_parsed_css(custom_metrics) != '' AND
  client = 'desktop' AND
  is_root_page = false;

UPDATE
  `httparchive.all.pages`
SET
  custom_metrics = remove_parsed_css(custom_metrics)
WHERE
  date = '2024-08-01' AND
  custom_metrics LIKE '%parsed_css%' AND
  remove_parsed_css(custom_metrics) != '' AND
  client = 'desktop' AND
  is_root_page;

UPDATE
  `httparchive.all.pages`
SET
  custom_metrics = remove_parsed_css(custom_metrics)
WHERE
  date = '2024-08-01' AND
  custom_metrics LIKE '%parsed_css%' AND
  remove_parsed_css(custom_metrics) != '' AND
  client = 'mobile' AND
  is_root_page = false;

UPDATE
  `httparchive.all.pages`
SET
  custom_metrics = remove_parsed_css(custom_metrics)
WHERE
  date = '2024-08-01' AND
  custom_metrics LIKE '%parsed_css%' AND
  remove_parsed_css(custom_metrics) != '' AND
  client = 'mobile' AND
  is_root_page;