HTTPArchive / bigquery

BigQuery import and processing pipelines
67 stars 20 forks source link

Investigate 3x increase in response body rows as of 2021_07_01 #124

Closed rviscomi closed 3 years ago

rviscomi commented 3 years ago

Now that we've got the first response_bodies data in several months, it's strange to see a steep increase in the number of rows per table despite the table size (TB) not growing by as much: https://datastudio.google.com/u/0/reporting/1jh_ScPlCIbSYTf2r2Y6EftqmX9SQy4Gn/page/5ike

image

Investigate the cause of the increased rows and deduplicate if needed. This table will be used by the 2021 Web Almanac, so it's important to make sure it doesn't introduce any data errors.

A couple of theories to start on:

tunetheweb commented 3 years ago

Won't this process all rows but send null rows for the ones that don't match, since get_response_bodies_a returns null for half the rows (and similarly for get_response_bodies_b)?

https://github.com/HTTPArchive/bigquery/blob/9f5d97f1a435421fb930cb503989776a0b126998/dataflow/python/bigquery_import.py#L366-L380

Compare this for Lighthouse where it only runs for mobile:

https://github.com/HTTPArchive/bigquery/blob/9f5d97f1a435421fb930cb503989776a0b126998/dataflow/python/bigquery_import.py#L390-L398

tunetheweb commented 3 years ago

Actually I think something else is going on here. It looks like it used to only return text rows in this table, but now returns all rows.

For example this:

SELECT COUNT(1)
FROM `httparchive.response_bodies.2020_07_01_desktop`
WHERE url LIKE '%.jpg'

Returns 63,989 rows for 2020_07_01 (the 404s maybe?), and 78,429,806 for 2021_07_01.

Similarly for fonts:

SELECT COUNT(1)
FROM `httparchive.response_bodies.2020_07_01_desktop`
WHERE url LIKE '%.woff'

Also seem to be including the WOFF bodies (explaining the growth in TB?), but not the JPG? We shouldn't be including binary bodies at all.

rviscomi commented 3 years ago

Good find. For example in the response_bodies for almanac.httparchive.org I'm seeing URLs like https://almanac.httparchive.org/static/fonts/Lato-Bold.woff2 and https://almanac.httparchive.org/static/images/home-hero.png. @pmeenan is this a WPT bug?

pmeenan commented 3 years ago

Probably a Chrome change that changed WPT's text-only filtering. Looking now.

pmeenan commented 3 years ago

Hmm, I'm having trouble reproducing it with almanac.httparchive.org. Any chance I can get a few pages that included WOFF bodies?

Wonder if maybe there's some sort of interaction with WPT and some of the new custom metrics in case any of them are doing fetches (I'll triple-check to make sure WPT doesn't grab bodies outside of the actual test)

rviscomi commented 3 years ago
SELECT * FROM `httparchive.response_bodies.2021_07_01_desktop` WHERE page = 'https://almanac.httparchive.org/'

Be aware this processes 15 TB.

page url
https://almanac.httparchive.org/ https://almanac.httparchive.org/
https://almanac.httparchive.org/ https://almanac.httparchive.org/en/2020/
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/css/normalize.css?v=112272e51c80ffe5bd01becd2ce7d656
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/css/almanac.css?v=98a5dc5255545dca5c3f3826dd3567bd
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/css/index.css?v=dd16269b4c1c3af01e8b635f00c16f69
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/js/almanac.js?v=c8c72daa5804a31cf1a59084b8745691
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/js/web-vitals.js?v=12282d9f0577af53119267cc8d81dc64
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/js/send-web-vitals.js?v=f176ee2628e8a2a549a6f5f3e122ee22
https://almanac.httparchive.org/ https://www.googletagmanager.com/gtag/js?id=UA-22381566-3
https://almanac.httparchive.org/ https://www.google-analytics.com/plugins/ua/linkid.js
https://almanac.httparchive.org/ https://www.google-analytics.com/j/collect?v=1&_v=j91&a=536456327&t=pageview&_s=1&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&_u=aGBAAUIhAAAAAC~&jid=904247328&gjid=386119546&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&_r=1&gtm=2ou7e0&tc=x&z=1146007784
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/fonts/Poppins-Light.woff2
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/fonts/Lato-Regular.woff2
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/fonts/Poppins-Bold.woff2
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/fonts/Lato-Black.woff2
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/fonts/Lato-Bold.woff2
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/images/home-hero.png
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/css/page.css?v=ca124dcff4fd4fa1c5edb0e23b77f839
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/images/character-markup.png
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/images/character-star.png
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/images/character-hat.png
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/fonts/Lato-Italic.woff2
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&ni=1&_s=2&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=Web%20Vitals&ea=FCP&el=v2-1626329306589-8113144612587&ev=1022&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=1268143968
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&_s=3&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=user&ea=print-mode&el=false&ev=0&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=1992814082
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&_s=4&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=user&ea=min-sheets-width&el=true&ev=1&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=1661527953
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&_s=5&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=user&ea=data-saver&el=not-enabled&ev=0&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=1725817895
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&_s=6&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=user&ea=connection-type&el=4g&ev=1&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=914507272
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&_s=7&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=user&ea=hi-res-canvas&el=supported&ev=1&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=1818432849
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&ni=1&_s=8&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=Web%20Vitals&ea=TTFB&el=v2-1626329306590-9863686315487&ev=633&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=1635327878
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/images/favicon.ico
https://almanac.httparchive.org/ https://www.google-analytics.com/analytics.js
pmeenan commented 3 years ago

Sorry, I meant pages other than the almanac that included woff or jpeg bodies. I'll see if I can write up a query.

tunetheweb commented 3 years ago

This query returned WOFF fonts with bodies:

SELECT *
FROM `httparchive.response_bodies.2021_07_01_desktop`
WHERE url LIKE '%.woff'

Can add a AND body IS NOT NULL at end if you want.

Weirdly when I ran the same for .jpg I got rows (which I shouldn’t) but the body column was empty (which is good at least), while for .woff it looked like binary WOFF data was in the body column.

tunetheweb commented 3 years ago

Here's an example: https://webpagetest.httparchive.org/result/210718_Dx12_23SR/1/details/#waterfall_view_step1

image

Also not repeatable in regular WPT, but then again only one of the fonts was captured so it could be intermittent? Then again, the same font body was also captured for Mobile HTTP Archive run: https://webpagetest.httparchive.org/result/210715_MxAT_N42X/1/details/#waterfall_view_step1 (request 49). Interestingly the waterfall is completely different between desktop and mobile but we still saw the issue.

rviscomi commented 3 years ago

I think I understand the difference. In the old Java pipeline it omitted responses that had no body:

https://github.com/HTTPArchive/bigquery/blob/9f5d97f1a435421fb930cb503989776a0b126998/dataflow/java/src/main/java/com/httparchive/dataflow/BigQueryImport.java#L329

In the new Python pipeline, anything without a body defaults to the empty string:

https://github.com/HTTPArchive/bigquery/blob/9f5d97f1a435421fb930cb503989776a0b126998/dataflow/python/bigquery_import.py#L163

So a potential fix would be something like this:

    body = request.get('response').get('content').get('text', None)

    if body == None:
      continue

We could clean up the BQ tables by deleting any row that has body='' although that might delete legitimate response bodies that exist but are empty.

pmeenan commented 3 years ago

Strange, the font example above actually comes back from chrome as a utf8 string and there is no content type on the response. I can exclude it by extension but I think a better way may be to use the 'sec-fetch-dest' request header to not store anything that is requested as a font, image, video, etc

pmeenan commented 3 years ago

Just rolled out the filtering to use the Sec-Fetch-Dest request header as an additional filter to keep images, fonts and video data out of the bodies.

rviscomi commented 3 years ago

Regenerating the July 2021 tables using the new pipeline code. The mobile table is running now and will be ready in ~17 hours. The desktop table will be another day.