HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community
https://almanac.httparchive.org
Apache License 2.0
613 stars 173 forks source link

Generate 2020 almanac tables on BigQuery #1258

Closed rviscomi closed 3 years ago

rviscomi commented 4 years ago

The 2020_08_01 HTTP Archive crawl completed yesterday and the tables are available on BigQuery. However, to facilitate Web Almanac analysis, we reorganize the data into the almanac dataset to make the results more efficient to query.

@paulcalvano and I will be prepping this dataset with the 2020 results. The existing tables already contain 2019 data and they do not necessarily make that clear. We should continue to retain the 2019 data and alter the table schemas to add a new date field to distinguish the annual editions.

There are also a couple of externally-sourced tables:

And there are a couple of convenience tables that may or may not need to be updated, depending on 2020 usage:

I'd like to explore whether it's feasible to combine the request/response tables into a single table that contains the summary metadata, request payload, and response bodies. That way there would be no SQL joins to contend with for the analysts. The tables would be enormous but AFAIK BigQuery only bills for the columns used, so queries that don't require the bodies would be much cheaper. Not sure if performance is worse.

rviscomi commented 4 years ago

I'm replacing the requests table with the contents of requests3, which was the third and most accurate representation of the summary+JSON data. I'm also including a new field requestId which is extracted from the JSON. There are 11 queries from 2019 that reference requests3. These should all be updated to refer to requests instead.

I plan to regenerate the response_bodies tables for the 2020_08_01 crawl with a change to the Dataflow pipeline so that the requestId from the request is included with the response bodies. This will make it possible to accurately join requests with response bodies for the summary_response_bodies table. Otherwise it's not possible to disambiguate responses to repeated/redundant requests.

tunetheweb commented 4 years ago

And there are a couple of convenience tables that may or may not need to be updated, depending on 2020 usage:

  • [ ] manifests
  • [ ] service_workers

Was looking over the 2019 PWA queries and think it's highly likely this data would be asked for for 2020, as they form the basis of most of the 2019 queries. I gather they are just subsets of the request_bodies tables made to allow cheaper querying?

tunetheweb commented 4 years ago

The PWA authors confirmed they do want same stats as last night so let us know if it's possible to date stamp those two tables and add this years stats.

Also see this comment:

@rviscomi would it be possible to create an almanac.response_bodies_scripts table of just the initial HTML (in case of inline Githubissues.

  • Githubissues is a development platform for aggregating issues.