Generate 2020 almanac tables on BigQuery

rviscomi commented 4 years ago

The 2020_08_01 HTTP Archive crawl completed yesterday and the tables are available on BigQuery. However, to facilitate Web Almanac analysis, we reorganize the data into the almanac dataset to make the results more efficient to query.

@paulcalvano and I will be prepping this dataset with the 2020 results. The existing tables already contain 2019 data and they do not necessarily make that clear. We should continue to retain the 2019 data and alter the table schemas to add a new date field to distinguish the annual editions.

[x] parsed_css
- Use Rework CSS to parse CSS bodies and save resulting JSON to the table
- Note: this table already contains a date column
- See this thread for more context behind the process to generate this table
- Parse both stylesheets and inline style blocks
[x] requests
- Combination of summary_requests metadata with requests payloads
- See https://github.com/HTTPArchive/almanac.httparchive.org/issues/180 for more context behind joining the tables
[x] summary_response_bodies
- Combination of summary_requests metadata with response_bodies blobs

There are also a couple of externally-sourced tables:

[x] third_parties
- See https://github.com/HTTPArchive/almanac.httparchive.org/issues/1061 for more context behind sourcing the data source
[x] h2_prioritization_cdns
- The table is currently named h2_prioritization_cdns_201909 and is in use by the 2019 HTTP/2 metric 20_07.sql
- Not sure if this data will be useful again this year, if so @paulcalvano may know how to regenerate it with the latest data
- The table schema should be altered to be partitioned by a date field so we can distinguish between annual versions

And there are a couple of convenience tables that may or may not need to be updated, depending on 2020 usage:

[x] manifests
[x] service_workers

I'd like to explore whether it's feasible to combine the request/response tables into a single table that contains the summary metadata, request payload, and response bodies. That way there would be no SQL joins to contend with for the analysts. The tables would be enormous but AFAIK BigQuery only bills for the columns used, so queries that don't require the bodies would be much cheaper. Not sure if performance is worse.

rviscomi commented 4 years ago

I'm replacing the requests table with the contents of requests3, which was the third and most accurate representation of the summary+JSON data. I'm also including a new field requestId which is extracted from the JSON. There are 11 queries from 2019 that reference requests3. These should all be updated to refer to requests instead.

I plan to regenerate the response_bodies tables for the 2020_08_01 crawl with a change to the Dataflow pipeline so that the requestId from the request is included with the response bodies. This will make it possible to accurately join requests with response bodies for the summary_response_bodies table. Otherwise it's not possible to disambiguate responses to repeated/redundant requests.

tunetheweb commented 4 years ago

And there are a couple of convenience tables that may or may not need to be updated, depending on 2020 usage:

[ ] manifests

[ ] service_workers

Was looking over the 2019 PWA queries and think it's highly likely this data would be asked for for 2020, as they form the basis of most of the 2019 queries. I gather they are just subsets of the request_bodies tables made to allow cheaper querying?

tunetheweb commented 4 years ago

The PWA authors confirmed they do want same stats as last night so let us know if it's possible to date stamp those two tables and add this years stats.

Also see this comment:

@rviscomi would it be possible to create an almanac.response_bodies_scripts table of just the initial HTML (in case of inline Githubissues.
Githubissues is a development platform for aggregating issues.

HTTPArchive / almanac.httparchive.org

Generate 2020 almanac tables on BigQuery #1258