Closed rviscomi closed 4 years ago
The 2019_06_01_desktop
table was a known issue so I've manually fixed that one. To help identify other tables with incompatible schemas, I ran this query:
SELECT
table_name, column_name
FROM
`httparchive.summary_requests.INFORMATION_SCHEMA.COLUMNS`
WHERE
data_type = 'NUMERIC'
table_name | column_name |
---|---|
2019_04_01_desktop | startedDateTime |
2019_04_01_desktop | time |
2019_04_01_desktop | reqHeadersSize |
2019_04_01_desktop | reqBodySize |
2019_04_01_desktop | reqCookieLen |
2019_04_01_desktop | status |
2019_04_01_desktop | respHeadersSize |
2019_04_01_desktop | respBodySize |
2019_04_01_desktop | respSize |
2019_04_01_desktop | respCookieLen |
2019_04_01_desktop | expAge |
2019_04_01_desktop | crawlid |
2019_03_01_desktop | startedDateTime |
2019_03_01_desktop | time |
2019_03_01_desktop | reqHeadersSize |
2019_03_01_desktop | reqBodySize |
2019_03_01_desktop | reqCookieLen |
2019_03_01_desktop | status |
2019_03_01_desktop | respHeadersSize |
2019_03_01_desktop | respBodySize |
2019_03_01_desktop | respSize |
2019_03_01_desktop | respCookieLen |
2019_03_01_desktop | expAge |
2019_03_01_desktop | crawlid |
2019_02_01_mobile | startedDateTime |
2019_02_01_mobile | time |
2019_02_01_mobile | reqHeadersSize |
2019_02_01_mobile | reqBodySize |
2019_02_01_mobile | reqCookieLen |
2019_02_01_mobile | status |
2019_02_01_mobile | respHeadersSize |
2019_02_01_mobile | respBodySize |
2019_02_01_mobile | respSize |
2019_02_01_mobile | respCookieLen |
2019_02_01_mobile | expAge |
2019_02_01_mobile | crawlid |
The three remaining tables in need of repair are:
2019_04_01_desktop
2019_03_01_desktop
2019_02_01_mobile
To correct the schema, I saved the output of the following query to a temp table, deleted the original table, and renamed the temp table to the original table name.
Example for 2019_03_01_desktop
:
SELECT
CAST(null AS INT64) AS requestid,
CAST(pageid AS INT64) AS pageid,
CAST(startedDateTime AS INT64) AS startedDateTime,
CAST(time AS INT64) AS time,
method, url, urlShort, redirectUrl, firstReq, firstHtml, reqHttpVersion,
CAST(reqHeadersSize AS INT64) AS reqHeadersSize,
CAST(reqBodySize AS INT64) AS reqBodySize,
CAST(reqCookieLen AS INT64) AS reqCookieLen,
reqOtherHeaders,
CAST(status AS INT64) AS status,
respHttpVersion,
CAST(respHeadersSize AS INT64) AS respHeadersSize,
CAST(respBodySize AS INT64) AS respBodySize,
CAST(respSize AS INT64) AS respSize,
CAST(respCookieLen AS INT64) AS respCookieLen,
CAST(LEAST(expAge, 999999) AS INT64) AS expAge,
mimeType, respOtherHeaders, req_accept, req_accept_charset, req_accept_encoding, req_accept_language, req_connection, req_host, req_if_modified_since, req_if_none_match, req_referer, req_user_agent, resp_accept_ranges, resp_age, resp_cache_control, resp_connection, resp_content_encoding, resp_content_language, resp_content_length, resp_content_location, resp_content_type, resp_date, resp_etag, resp_expires, resp_keep_alive, resp_last_modified, resp_location, resp_pragma, resp_server, resp_transfer_encoding, resp_vary, resp_via, resp_x_powered_by, _cdn_provider,
CAST(_gzip_save AS INT64) AS _gzip_save,
CAST(crawlid AS INT64) AS crawlid,
type, ext, format
FROM
`httparchive.summary_requests.2019_03_01_desktop`
Doing it this way preserves column ordering, which is necessary for UNION ALL
queries.
Note the special case for expAge
. Some of the numeric values were exceeding the max int value.
Here's an example query that processes the respCookieLen
field across tables and completes successfully:
SELECT
SUBSTR(_TABLE_SUFFIX, 0, 10) AS date,
COUNT(DISTINCT NET.REG_DOMAIN(url)) AS domains
FROM
`httparchive.summary_requests.*`
WHERE
respCookieLen > 0
GROUP BY
date
ORDER BY
date DESC
(Warning this query processes 2.5TB as of now and will get even larger over time as more tables are added)
This fixes long-standing issues with the monthly generation of the latest summary_requests tables 🎉
Queries that span multiple tables in the
summary_requests
dataset are failing with an error message likeCannot read field 'respCookieLen' of type NUMERIC as INT64
. This is due to past backfill attempts that altered the schema of the data.In order to ensure schema compatibility across tables, manually backfilled tables must be reprocessed to cast all values into compatible types.