Closed rviscomi closed 3 years ago
Won't this process all rows but send null rows for the ones that don't match, since get_response_bodies_a
returns null for half the rows (and similarly for get_response_bodies_b
)?
Compare this for Lighthouse where it only runs for mobile:
Actually I think something else is going on here. It looks like it used to only return text rows in this table, but now returns all rows.
For example this:
SELECT COUNT(1)
FROM `httparchive.response_bodies.2020_07_01_desktop`
WHERE url LIKE '%.jpg'
Returns 63,989 rows for 2020_07_01 (the 404s maybe?), and 78,429,806 for 2021_07_01.
Similarly for fonts:
SELECT COUNT(1)
FROM `httparchive.response_bodies.2020_07_01_desktop`
WHERE url LIKE '%.woff'
Also seem to be including the WOFF bodies (explaining the growth in TB?), but not the JPG? We shouldn't be including binary bodies at all.
Good find. For example in the response_bodies for almanac.httparchive.org I'm seeing URLs like https://almanac.httparchive.org/static/fonts/Lato-Bold.woff2 and https://almanac.httparchive.org/static/images/home-hero.png. @pmeenan is this a WPT bug?
Probably a Chrome change that changed WPT's text-only filtering. Looking now.
Hmm, I'm having trouble reproducing it with almanac.httparchive.org. Any chance I can get a few pages that included WOFF bodies?
Wonder if maybe there's some sort of interaction with WPT and some of the new custom metrics in case any of them are doing fetches (I'll triple-check to make sure WPT doesn't grab bodies outside of the actual test)
SELECT * FROM `httparchive.response_bodies.2021_07_01_desktop` WHERE page = 'https://almanac.httparchive.org/'
Be aware this processes 15 TB.
Sorry, I meant pages other than the almanac that included woff or jpeg bodies. I'll see if I can write up a query.
This query returned WOFF fonts with bodies:
SELECT *
FROM `httparchive.response_bodies.2021_07_01_desktop`
WHERE url LIKE '%.woff'
Can add a AND body IS NOT NULL
at end if you want.
Weirdly when I ran the same for .jpg
I got rows (which I shouldn’t) but the body
column was empty (which is good at least), while for .woff
it looked like binary WOFF data was in the body
column.
Here's an example: https://webpagetest.httparchive.org/result/210718_Dx12_23SR/1/details/#waterfall_view_step1
Also not repeatable in regular WPT, but then again only one of the fonts was captured so it could be intermittent? Then again, the same font body was also captured for Mobile HTTP Archive run: https://webpagetest.httparchive.org/result/210715_MxAT_N42X/1/details/#waterfall_view_step1 (request 49). Interestingly the waterfall is completely different between desktop and mobile but we still saw the issue.
I think I understand the difference. In the old Java pipeline it omitted responses that had no body:
In the new Python pipeline, anything without a body defaults to the empty string:
So a potential fix would be something like this:
body = request.get('response').get('content').get('text', None)
if body == None:
continue
We could clean up the BQ tables by deleting any row that has body=''
although that might delete legitimate response bodies that exist but are empty.
Strange, the font example above actually comes back from chrome as a utf8 string and there is no content type on the response. I can exclude it by extension but I think a better way may be to use the 'sec-fetch-dest' request header to not store anything that is requested as a font, image, video, etc
Just rolled out the filtering to use the Sec-Fetch-Dest request header as an additional filter to keep images, fonts and video data out of the bodies.
Regenerating the July 2021 tables using the new pipeline code. The mobile table is running now and will be ready in ~17 hours. The desktop table will be another day.
Now that we've got the first
response_bodies
data in several months, it's strange to see a steep increase in the number of rows per table despite the table size (TB) not growing by as much: https://datastudio.google.com/u/0/reporting/1jh_ScPlCIbSYTf2r2Y6EftqmX9SQy4Gn/page/5ikeInvestigate the cause of the increased rows and deduplicate if needed. This table will be used by the 2021 Web Almanac, so it's important to make sure it doesn't introduce any data errors.
A couple of theories to start on: