Closed tunetheweb closed 3 weeks ago
Ahh. Looks like the code was stripping it out of "payload" (the raw full JSON) but not from the extracted custom_metrics. Cleaning it up now.
Running this to clean it up:
UPDATE
`httparchive.all.pages`
SET
custom_metrics = TO_JSON_STRING(JSON_REMOVE(SAFE.PARSE_JSON(custom_metrics),'$.parsed_css'))
WHERE
date = '2024-08-01' AND
custom_metrics IS NOT NULL AND
SAFE.PARSE_JSON(custom_metrics) IS NOT NULL;
UPDATE
`httparchive.all.pages`
SET
custom_metrics = TO_JSON_STRING(JSON_REMOVE(SAFE.PARSE_JSON(custom_metrics),'$.parsed_css'))
WHERE
date = '2024-07-01' AND
custom_metrics IS NOT NULL AND
SAFE.PARSE_JSON(custom_metrics) IS NOT NULL;
UPDATE
`httparchive.all.pages`
SET
custom_metrics = TO_JSON_STRING(JSON_REMOVE(SAFE.PARSE_JSON(custom_metrics),'$.parsed_css'))
WHERE
date = '2024-06-01' AND
custom_metrics IS NOT NULL AND
SAFE.PARSE_JSON(custom_metrics) IS NOT NULL;
Turns out PARSE_JSON
is a bit limited in BigQuery so had to do this via JavaScript instead (and repeat for July and June):
CREATE TEMP FUNCTION remove_parsed_css(cm STRING)
RETURNS STRING
LANGUAGE js AS """
try {
cm = JSON.parse(cm);
delete cm.parsed_css;
return JSON.stringify(cm);
} catch {
return ''
}
""";
UPDATE
`httparchive.all.pages`
SET
custom_metrics = remove_parsed_css(custom_metrics)
WHERE
date = '2024-08-01' AND
custom_metrics LIKE '%parsed_css%' AND
remove_parsed_css(custom_metrics) != '' AND
client = 'desktop' AND
is_root_page = false;
UPDATE
`httparchive.all.pages`
SET
custom_metrics = remove_parsed_css(custom_metrics)
WHERE
date = '2024-08-01' AND
custom_metrics LIKE '%parsed_css%' AND
remove_parsed_css(custom_metrics) != '' AND
client = 'desktop' AND
is_root_page;
UPDATE
`httparchive.all.pages`
SET
custom_metrics = remove_parsed_css(custom_metrics)
WHERE
date = '2024-08-01' AND
custom_metrics LIKE '%parsed_css%' AND
remove_parsed_css(custom_metrics) != '' AND
client = 'mobile' AND
is_root_page = false;
UPDATE
`httparchive.all.pages`
SET
custom_metrics = remove_parsed_css(custom_metrics)
WHERE
date = '2024-08-01' AND
custom_metrics LIKE '%parsed_css%' AND
remove_parsed_css(custom_metrics) != '' AND
client = 'mobile' AND
is_root_page;
Looks like this started happening from June:
This is 8.32TB:
This is 63.29TB!!!!!!!:
Will delete from June, July and August, but can you have a look at preventing this happening @pmeenan ?