HTTPArchive / httparchive.org

The HTTP Archive website hosted on App Engine
https://httparchive.org
Apache License 2.0
337 stars 43 forks source link

Queries over summary_requests dataset failing #203

Closed rviscomi closed 4 years ago

rviscomi commented 4 years ago

Queries that span multiple tables in the summary_requests dataset are failing with an error message like Cannot read field 'respCookieLen' of type NUMERIC as INT64. This is due to past backfill attempts that altered the schema of the data.

In order to ensure schema compatibility across tables, manually backfilled tables must be reprocessed to cast all values into compatible types.

rviscomi commented 4 years ago

The 2019_06_01_desktop table was a known issue so I've manually fixed that one. To help identify other tables with incompatible schemas, I ran this query:

SELECT
  table_name, column_name
FROM
  `httparchive.summary_requests.INFORMATION_SCHEMA.COLUMNS`
WHERE
  data_type = 'NUMERIC'
table_name column_name
2019_04_01_desktop startedDateTime
2019_04_01_desktop time
2019_04_01_desktop reqHeadersSize
2019_04_01_desktop reqBodySize
2019_04_01_desktop reqCookieLen
2019_04_01_desktop status
2019_04_01_desktop respHeadersSize
2019_04_01_desktop respBodySize
2019_04_01_desktop respSize
2019_04_01_desktop respCookieLen
2019_04_01_desktop expAge
2019_04_01_desktop crawlid
2019_03_01_desktop startedDateTime
2019_03_01_desktop time
2019_03_01_desktop reqHeadersSize
2019_03_01_desktop reqBodySize
2019_03_01_desktop reqCookieLen
2019_03_01_desktop status
2019_03_01_desktop respHeadersSize
2019_03_01_desktop respBodySize
2019_03_01_desktop respSize
2019_03_01_desktop respCookieLen
2019_03_01_desktop expAge
2019_03_01_desktop crawlid
2019_02_01_mobile startedDateTime
2019_02_01_mobile time
2019_02_01_mobile reqHeadersSize
2019_02_01_mobile reqBodySize
2019_02_01_mobile reqCookieLen
2019_02_01_mobile status
2019_02_01_mobile respHeadersSize
2019_02_01_mobile respBodySize
2019_02_01_mobile respSize
2019_02_01_mobile respCookieLen
2019_02_01_mobile expAge
2019_02_01_mobile crawlid

The three remaining tables in need of repair are:

rviscomi commented 4 years ago

To correct the schema, I saved the output of the following query to a temp table, deleted the original table, and renamed the temp table to the original table name.

Example for 2019_03_01_desktop:

SELECT
  CAST(null AS INT64) AS requestid,
  CAST(pageid AS INT64) AS pageid,
  CAST(startedDateTime AS INT64) AS startedDateTime,
  CAST(time AS INT64) AS time,
  method, url, urlShort, redirectUrl, firstReq, firstHtml, reqHttpVersion, 
  CAST(reqHeadersSize AS INT64) AS reqHeadersSize,
  CAST(reqBodySize AS INT64) AS reqBodySize,
  CAST(reqCookieLen AS INT64) AS reqCookieLen,
  reqOtherHeaders, 
  CAST(status AS INT64) AS status,
  respHttpVersion, 
  CAST(respHeadersSize AS INT64) AS respHeadersSize,
  CAST(respBodySize AS INT64) AS respBodySize,
  CAST(respSize AS INT64) AS respSize,
  CAST(respCookieLen AS INT64) AS respCookieLen,
  CAST(LEAST(expAge, 999999) AS INT64) AS expAge,
  mimeType, respOtherHeaders, req_accept, req_accept_charset, req_accept_encoding, req_accept_language, req_connection, req_host, req_if_modified_since, req_if_none_match, req_referer, req_user_agent, resp_accept_ranges, resp_age, resp_cache_control, resp_connection, resp_content_encoding, resp_content_language, resp_content_length, resp_content_location, resp_content_type, resp_date, resp_etag, resp_expires, resp_keep_alive, resp_last_modified, resp_location, resp_pragma, resp_server, resp_transfer_encoding, resp_vary, resp_via, resp_x_powered_by, _cdn_provider,
  CAST(_gzip_save AS INT64) AS _gzip_save,
  CAST(crawlid AS INT64) AS crawlid,
  type, ext, format
FROM
  `httparchive.summary_requests.2019_03_01_desktop`

Doing it this way preserves column ordering, which is necessary for UNION ALL queries.

Note the special case for expAge. Some of the numeric values were exceeding the max int value.

rviscomi commented 4 years ago

Here's an example query that processes the respCookieLen field across tables and completes successfully:

SELECT
  SUBSTR(_TABLE_SUFFIX, 0, 10) AS date,
  COUNT(DISTINCT NET.REG_DOMAIN(url)) AS domains
FROM
  `httparchive.summary_requests.*`
WHERE
  respCookieLen > 0
GROUP BY
  date
ORDER BY
  date DESC

(Warning this query processes 2.5TB as of now and will get even larger over time as more tables are added)

image

rviscomi commented 4 years ago

This fixes long-standing issues with the monthly generation of the latest summary_requests tables 🎉