Closed rviscomi closed 3 years ago
I'm replacing the requests
table with the contents of requests3
, which was the third and most accurate representation of the summary+JSON data. I'm also including a new field requestId
which is extracted from the JSON. There are 11 queries from 2019 that reference requests3
. These should all be updated to refer to requests
instead.
I plan to regenerate the response_bodies
tables for the 2020_08_01 crawl with a change to the Dataflow pipeline so that the requestId
from the request is included with the response bodies. This will make it possible to accurately join requests with response bodies for the summary_response_bodies
table. Otherwise it's not possible to disambiguate responses to repeated/redundant requests.
And there are a couple of convenience tables that may or may not need to be updated, depending on 2020 usage:
- [ ]
manifests
- [ ]
service_workers
Was looking over the 2019 PWA queries and think it's highly likely this data would be asked for for 2020, as they form the basis of most of the 2019 queries. I gather they are just subsets of the request_bodies
tables made to allow cheaper querying?
The PWA authors confirmed they do want same stats as last night so let us know if it's possible to date stamp those two tables and add this years stats.
Also see this comment:
@rviscomi would it be possible to create an almanac.response_bodies_scripts table of just the initial HTML (in case of inline Githubissues.
Githubissues is a development platform for aggregating issues.
The 2020_08_01 HTTP Archive crawl completed yesterday and the tables are available on BigQuery. However, to facilitate Web Almanac analysis, we reorganize the data into the
almanac
dataset to make the results more efficient to query.@paulcalvano and I will be prepping this dataset with the 2020 results. The existing tables already contain 2019 data and they do not necessarily make that clear. We should continue to retain the 2019 data and alter the table schemas to add a new date field to distinguish the annual editions.
parsed_css
date
columnrequests
summary_requests
metadata withrequests
payloadssummary_response_bodies
summary_requests
metadata withresponse_bodies
blobsThere are also a couple of externally-sourced tables:
third_parties
h2_prioritization_cdns
h2_prioritization_cdns_201909
and is in use by the 2019 HTTP/2 metric 20_07.sqlAnd there are a couple of convenience tables that may or may not need to be updated, depending on 2020 usage:
manifests
service_workers
I'd like to explore whether it's feasible to combine the request/response tables into a single table that contains the summary metadata, request payload, and response bodies. That way there would be no SQL joins to contend with for the analysts. The tables would be enormous but AFAIK BigQuery only bills for the columns used, so queries that don't require the bodies would be much cheaper. Not sure if performance is worse.