Closed foxdavidj closed 3 years ago
Question about what this chapter should be called given that HTTP/3 is now here (even if not quite officially signed off yet). Stick with HTTP/2? Change to HTTP? Should we rename the 2019 chapter (with redirect obviously) or leave as is? Probably best to wait until we've got an author/authors to let them help decide.
That's a great question, I'm not sure what the best name is. Agreed, let's see where the content planning takes us and keep the option open to rename the chapter if needed.
Safe to add you as a reviewer for this chapter, Barry?
My thoughts: frame it as HTTP, and break it down to an acceptable level - accepting that it gets very complex quickly if you look at HTTP/1.1 vs HTTP/2 (streams, prioritization, etc) vs HTTP/3 (QUIC transport, etc).
A lot of the HTTP semantics users & web developers interact with are consistent across versions, and we should start there, with sub-sections for what HTTP/2 and HTTP/3 bring (and why they exist).
Agree with that.
Still considering what to do on last year's chapter. Do we rename it? Gut feel is no, as it was very HTTP/2 focused (with a quick dip into HTTP/3 at the end), even if that does lead to a slight inconsistency in the naming cross year.
I spent a good part of last year's HTTP/2 chapter giving the basics as still think this is a fairly new (even if it was approaching it's 5 years anniversary back then), and little understood technology. Think it would be good to have a similar intro to HTTP/3 this year, and perhaps less on HTTP/2 (we can refer back to the previous year's chapter for that).
However, the main point of the Almanac IMHO is not to act as a reference of the technology (though some background is good, and necessary), but to look at it's usage through the HTTP Archive and help explain that to readers. So need to be conscious not to spend too much time on background/theory. I may have overdone it last year but, as I say, I think it was needed more so than for other chapters given how new the technology is and how niche the expertise is. And given HTTP/3 is even newer, maybe that need is still there this year?
Saying all that, I'm struggling to think what new stats to query for this chapter. But we'll worry about that once we've got authors and reviewers!
And on that subject I'm definitely up for reviewing this year. Can author too if we get really stuck but would prefer to hear from someone new if anyone volunteers! Either way, I'm defintely interested in following how this chapter progresses and to help in anyway I can for it.
Hello everyone. I'd like to again be a reviewer for this chapter this year. I could also contribute text on HTTP/3 and QUIC concepts if we go that route.
My 2 cents would be that the almanac should indeed focus more on the practical use of the tech seen over the past year, as measured by the HTTP archive runs. From that perspective, there won't be much to discuss on HTTP/3 yet, as few servers and browsers offer it and it's not ready for prime time (though, by the end of the year, it might be a bit more wide-spread).
For this year, you could look at how many sites offer H3 by looking at the alt-svc headers though. You could also look at TLS 1.3 adoption for H2, as this is kind of related to QUIC (or at least could give an indication of how up-to-date backends are). You could also research coalescing (or at least certificate contents) a bit more, as this will stay highly relevant for QUIC (and 0-RTT!) as well (maybe Matt Hobbs could help with that? Given his in-depth waterfall discussion blog posts on this). Finally, an idea of the measured RTTs to the backends would be useful, as that's where QUIC/H3 will provide most benefits.
To actually test H3 down the line, the HTTP archive runs would have to be adapted to also try a (secondary) load over QUIC after the normal H2 (H1?) connection, which might be something to think about @rviscomi (and probably also needs support from @pmeenan, who's been talking about this on twitter a bit as well).
Hello everyone. I'd like to again be a reviewer for this chapter this year. I could also contribute text on HTTP/3 and QUIC concepts if we go that route.
Great stuff!
My 2 cents would be that the almanac should indeed focus more on the practical use of the tech seen over the past year, as measured by the HTTP archive runs. From that perspective, there won't be much to discuss on HTTP/3 yet, as few servers and browsers offer it and it's not ready for prime time (though, by the end of the year, it might be a bit more wide-spread).
Think you'd be surprised with CDNs starting to offer it. It does seem to be growing. Especially if you include gQUIC.
For this year, you could look at how many sites offer H3 by looking at the alt-svc headers though. You could also look at TLS 1.3 adoption for H2, as this is kind of related to QUIC (or at least could give an indication of how up-to-date backends are). You could also research coalescing (or at least certificate contents) a bit more, as this will stay highly relevant for QUIC (and 0-RTT!) as well
Yeah those are the sorts of things I tried to look at last year too. The new author would be well advised to look at the metrics we settled on last year and the discussions around that (#22 )
(maybe Matt Hobbs could help with that? Given his in-depth waterfall discussion blog posts on this).
Ping @nooshu
Finally, an idea of the measured RTTs to the backends would be useful, as that's where QUIC/H3 will provide most benefits.
To actually test H3 down the line, the HTTP archive runs would have to be adapted to also try a (secondary) load over QUIC after the normal H2 (H1?) connection, which might be something to think about @rviscomi (and probably also needs support from @pmeenan, who's been talking about this on twitter a bit as well).
Reminds me of this discussion on trying to measure impact of HTTP/2
I should be able to review this chapter.
I'm happy to help review this chapter.
(maybe Matt Hobbs could help with that? Given his in-depth waterfall discussion blog posts on this).
Thanks @rmarx, I'd be happy to help.
Some more thoughts on this chapter:
Last year we concentrated on HTTP/2, with a bit of a mention of HTTP/3. Probably should talk a lot about HTTP/3 this year even if usage might be low.
However last year I almost completely ignored the whole topic of the underlying HTTP semantics. Should we add some more of that this year?
For example, how many HTTP Headers are sent? And what size are they? What's the size of headers compared to bodies on requests and responses? Some headers (e.g. CSP) can be quite large and we're adding new headers like feature-policy and with structured headers this could grow over time. This is of course another benefit of HTTP/2 and HTTP/3 as it has header compression.
What else could we consider along those lines for this year?
Do be aware that some of the other HTTP semantics are covered in other chapters:
I do agree it would be interesting to have a discussion on HTTP semantics and things like structured headers (and things that have been going wrong with their practical deployments, cc @yoavweiss).
However, like you say, several newer headers and their impact are discussed elsewhere and cutting-edge stuff like feature policy probably won't show up much this year. We then also should definitely re-name the chapter away from HTTP/2 imo.
As you know, I'm also highly skeptical about the practical impact of HPACK/QPACK for the normal web page loading use case. One area where you'd see improvements would be is with large cookies, but I'm not sure if the current test setup is ideal for measuring those (given that European sites shouldn't be setting cookies on first visit (theoretically) and some high-impact cookies probably only come into play after login/shopping cart stuff). However, this could also be an excellent opportunity to prove me wrong on both counts :) It would probably also unearth some cool/disturbing outliers. Do the WPT results include sizes for compressed headers? If not, we might setup something to run the plaintexts through HPACK and QPACK libraries to compare etc.
However, like you say, several newer headers and their impact are discussed elsewhere and cutting-edge stuff like feature policy probably won't show up much this year. We then also should definitely re-name the chapter away from HTTP/2 imo.
Feature Policy was discussed in security chapter though annoyingly it didn't discussed actually adoption (very small - looks to be about 1,000 sites at most looking at the raw data) and just which options were used when it was deployed. It's probably grown but not by that much. Referrer Policy looks to be used a lot more. Point is use of headers is growing and there is lots of innovation in this space.
As you know, I'm also highly skeptical about the practical impact of HPACK/QPACK for the normal web page loading use case. One area where you'd see improvements would be is with large cookies, but I'm not sure if the current test setup is ideal for measuring those (given that European sites shouldn't be setting cookies on first visit (theoretically) and some high-impact cookies probably only come into play after login/shopping cart stuff). However, this could also be an excellent opportunity to prove me wrong on both counts :) It would probably also unearth some cool/disturbing outliers.
I dunno. Some CSP headers are pretty big! But they're on the response where the files are usually much bigger so maybe you're right.
Do the WPT results include sizes for compressed headers? If not, we might setup something to run the plaintexts through HPACK and QPACK libraries to compare etc.
Discussed last year and not easily available.
Anyone on this thread interested in taking on the Author role? Or suggestions who could?
@elithrar not sure what role you were thinking of and if would be interested in Authoring?
@bagder @dotjs , as last year's other reviewers any interest here? Or suggestions of Authors?
And @Lpardue any further suggestions on this after our chat the other week given your role on QUIC-WG?
I'm happy to work on this chapter again. Sounds like a few people are interested in providing some content. I'm happy to pull it all together and convincing @LPardue to join in.
@MikeBishop - any interest in co-authoring this chapter?
@siyengar same question to you! 😀
@dotjs just want to confirm that you've reviewed the authoring commitment and the process works for you. Would love to have you as the lead author :)
Yes, I'd be happy to help, as author or reviewer.
I am willing and able to participate on authoring.
@dotjs just want to confirm that you've reviewed the authoring commitment and the process works for you. Would love to have you as the lead author :)
reviewed and looks fine to me
Hi. I would sign up for either chapter reviewer or analyst
@dotjs thank you for agreeing to be the lead author for the HTTP2 chapter! As the lead, you'll be responsible for driving the content planning and writing phases in collaboration with your content team, which will consist of yourself as lead, any coauthors you choose as needed, peer reviewers, and data analysts.
The immediate next steps for this chapter are:
There's a ton of info in the top comment, so check that out and feel free to ping myself or @rviscomi with any questions!
@MikeBishop @LPardue @rmarx @ibnesayeed @pmeenan @Nooshu I've put you down as reviewers for now, and will leave it to @dotjs to reassign at their discretion
@gregorywolf Put you down as both a reviewer and analyst :)
With this massive line-up already signed up I can stand down this year.
Hey @dotjs, hope you had a great weekend.
As you know, we're tying to have the outline and metrics settled on by the end of the week so we have time to configure the Web Crawler to track everything you need. Anything you need from me to keep things moving forward?
Also, can you remind your team to properly add and credit themselves in your chapter's Google Doc?
Added myself as a reviewer. Know we have a lot of them, but feel I deserve my place having written last year's chapter 😀 @dotjs you gonna move some of the reviewers to co-authors? Or taking on the full task yourself?
@gregorywolf happy to help out with Analysis here if you need any help. And the awesome @pmeenan being on team HTTP/2 will undoubtedly help if we have any questions as to what the HTTP Archive crawl currently does (or can!) get!
Thanks all - Curernt thoughts are to use co-authors. If everyone who has expressed an interst can request edit access to the doc. We can start to plan the content there. Let's focus any potential intersting metrics/measurements that were not part of last years run.
@rmarx @LPardue Keen on your thoughts on what intersting propoerties we can measure for QUIC/H3 etc. @pmeenan I'm personally interested in quantifying the impact of multiple domains/protocols on resource loading. This could include the impact of connection coalescence. Any thoughts on how we can quantify the 'thunderdome' ? How often is H2 prioritisation even relevant ?
@dotjs I'm not sure if it will be possible with bigquery but it might be possible with a script that crawls through the raw HAR files on GCS since the data includes chunk timings (and sizes), priority and connection info.
In theory you could check to see how often a higher priority response download is interrupted by chunks for a lower priority response (ignoring some small amount for headers). You could detect broken HTTP/2 prioritization when it happens on the same connection or cross-connection contention when it happens on a separate connection.
We'd have to noodle a bit to think of how that should be represented as a summary metric.
@dotjs How is the outline coming along? Want to get that finished up by the end of the week so we have time to get the Web Crawler setup :)
Have a first pass at an outline. I'm still not sure about going with HTTP with sections for H2 and H3 as suggested by Matt. I've added some thoughts about other things to discuss with regards HTTP e.g. semantics, DoH and websockets. Any reviewers/authors please add to the doc as I would like as many ideas on what other people would like to see in this chapter as possible. paging @gregorywolf , @Nooshu , @MikeBishop, @ibnesayeed , @bazzadp, @elithrar, @pmeenan and @LPardue
That's pretty comprehensive @dotjs ! Will rack my brains and see if I can think of anything else but can't at the mo...
Hello. I am coming up to speed with the project and specifically my task for the HTTP chapter as an analyst. My goal for this weekend is to finish reviewing all key material. I also want to look at all of the 2019 HTTP SQL queries and start retro fitting them to use on the sample data that @paulcalvano created. I am new to this process so PLEASE direct me as necessary. I look forward to working with the team.
@gregorywolf I am a comment here that might be useful: https://github.com/HTTPArchive/almanac.httparchive.org/issues/914#issuecomment-659205330
And since I’m on this chapter, I’ll update it specifically for this chapter 😀:
Start with the Analysts Guide and set up BigQuery (Good guide on that by our very own @paulcalvano who's leading the Analyst team here on the Web Almanac). Also be aware this can be expensive but there's a generous free tier and Paul will provide credits beyond that for Almanac work. There are also sample tables which are much cheaper to query and it should be difficult to go beyond the free budget with those. Then join the #web-almanac
slack and Paul will invite you to the Analysts channel on that.
For this chapter, you can read last year's chapter, look at last year's SQL for this chapter (and the actual results it produced) - both of these are linked at the bottom of the chapter btw. Familiarise yourself with all this, then work with @dotjs and the reviewers to figure out what metrics you want to use this this year and then convert them into queries. Would suggest reusing a lot of last year's queries but also adding some to give a fresh take. Liaise with the other Analysts and @paulcalvano if you have any questions on the data set and what's available. I can also help with this as on this chapter and similarly we’re lucky to have @pmeenan the God of WebPageTest (which is what our crawler uses) on this chapter if any queries on what’s possible or not.
We're planning to run the crawl for the 2020 dataset throughout August so critical point is to quickly figure out and implement any custom metrics required for that crawl before it starts. Would hope there shouldn't be too many (if any) as there is quite a lot of detail in the current dataset and we didn't need any for the HTTP/2 chapter last year. Luckily this chapter deals Mostly with the headers and meta data rather than stuff in the expensive bodies. Thought that may change this year depending on what we want to query.
Hope that helps and gives you something to get started on!
Hi. Quick update. I have updated all of the HTTP 2019 SQL queries. I have not submitted a PR yet. Once the sample_data tables are completed/finalized, I will start testing to make sure the output looks as expected. At that time I will submit a PR. I would be interested to know if anyone has any ideas on what data would be interesting that is above and beyond what was extracted last year.
@gregorywolf check out the analyst workflow doc if you haven't already. It may be helpful to create the PR now as a draft, and use it to keep track of metrics already implemented vs those not yet implemented using a markdown checklist. (steps 4 and 5)
Do any of the 2020 queries require custom metrics? (querying the DOM at runtime)
There are some interesting ideas that may or may require some digging into the HARs. @rviscomi Is there any precedent for this ? For example I'm interested in measuring multiplexing concurrency, concurrent connections etc. @gregorywolf Happy to chat through the metrics whenever you are ready.
There are some interesting ideas that may or may require some digging into the HARs. @rviscomi Is there any precedent for this ?
Could you clarify? Not sure if you're asking if any chapter has looked at the HAR data before or only if this is new for the H2 chapter.
@gregorywolf Took a look over the chapter and it looks like we've got most if not all of the data you need. Can you double check though? Only got a little more time left to make changes to the Crawler to collect extra data
@rviscomi Hi. I just submitted a draft PR for the sql 2019 queries formatted to use the sample_data tables.
@dotjs I think talking live would be great. Let's communicate via Slack DM to coordinate.
@dotjs @gregorywolf for the two milestones overdue on July 27 could you check the boxes if:
Keeping the milestone checklist up to date helps us to see at a glance how all of the chapters are progressing. Thanks for helping us to stay on schedule!
I've updated the chapter metadata at the top of this issue to link to the public spreadsheet that will be used for this chapter's query results. The sheet serves 3 purposes:
Hi. I am very close to finalizing all of the queries for the chapter. @rviscomi has been kind enough to run all of my queries so I do not run into a BQ quota issue. I will provide another update once all of the newest results have been generated and transferred to the results Google Sheet
All of the query results are posted in the results sheet. I have created pivot tables for all of the tabs. Please take a look at the data and provide feedback. I made a decision to NOT filter out any key fields that contain blanks. I will leave the filtering to the author
What's the best venue to provide feedback? Here on the issue, comments in the results sheet, etc.?
Personally I’d prefer it here. Or at the very least an “FYI I’ve made a comment on tabs 1, 2 and 5” type comment on this issue.
First off, thanks for the work you've already put in. This is an immense amount of data to digest, and you've clearly put in a lot of work slicing it into interpretable chunks.
For all of these, the pivot tables you mentioned would be useful to slice things, but I'm not able to actually filter anything in the sheet itself; I'm wondering if that's because I don't have edit access to the sheet? But I can copy the sheet and add filter views, it looks like.
Here's my first pass through the different pages:
Chiming in to give a couple of unsolicited Sheets tips: don't hesitate to request edit access if it'd help you explore the data, and change the default notification settings from "Only Yours" to "All" to be emailed on all comments even if you're not explicitly mentioned.
@MikeBishop , I can answer some of these based on experience last year as author and person who came up with a lot of these stat requests, and investigations I did on some of the same questions on last years stats.
For all of these, the pivot tables you mentioned would be useful to slice things, but I'm not able to actually filter anything in the sheet itself; I'm wondering if that's because I don't have edit access to the sheet? But I can copy the sheet and add filter views, it looks like.
Could be. Could you request edit permission to see?
Here's my first pass through the different pages:
- Adoption of H2 tab: How do we interpret the blank outcome? I don't want to just discard nearly 4% of requests, but it's not clear that it directly maps to any of the other versions, since they are represented.
We had the same last year and investing showed these to be mostly HTTP/1.1:
Annoyingly, there is a larger percentage where the protocol was not correctly tracked by the HTTP Archive crawl, particularly on desktop. Digging into this has shown various reasons, some of which can be explained and some of which can't. Based on spot checks, they mostly appear to be HTTP/1.1 requests and, assuming they are, desktop and mobile usage is similar.
It's a similar result this year - desktop is ~4% short of mobile and we have ~4% uncategorised.
Even better news, is I spent some time on this after (cause it bugged me to!) and figured out why this is the case and fixed it - unfortunately too late for this year's Almanac month (August) but we can look at October data to confirm this just before we go live. From the work on that fix we know the "protocol" is not always set for HTTP/1.1 and the parsing to try to pull it out from the request and response was broken. I'm pretty confident the vast majority is HTTP/1.1 and think we should assume this, explain it like I did last year, and quickly double check it after the October run to confirm.
Grouped by server:
- Same about the blanks, but it's more sensible here as some servers don't include that header.
- I wonder about spinning these two tabs together, to see whether there are trends of servers more or less likely to serve HTTP/2. I imagine that, exempting those which simply don't implement HTTP/2, it would turn into a statement about default-on vs. default-off.
Some interesting stats and discussion on that last year. @gregorywolf I added client
to some of the pivot tables as percentages were wrong without them as adding up (unless Apache really is 95% of server usage 😁)
Alt-Svc headers: I think it would be more useful to break these down into what percentage offer certain things in Alt-Svc, rather than just the discrete header values. (Though I'm very surprised there are enough instances to gather any appreciable percentages on a specific value; when I did a similar query a few years ago, I found that "clear" was the only thing that had enough consistency for that.) For example:
- Percentage that are "clear", the only defined keyword for this header, which we can already see from this table
- Percentage that offer h2
- Percentage that offer various QUIC/H3 versions
- Percentage that refer to same/different host or port
- Distribution of ma values
- How many alternatives per protocol? How many different protocols?
- For the Upgrade header, I'd like the ability to filter those by HTTP/HTTPS. Upgrading to h2c is only supposed to be offered on clear-text connections, but a recent article pointed out that some servers that support it will still do the Upgrade within an HTTP/1.1 TLS connection (presumably because something else is terminating TLS and the server sees it as a clear-text connection).
That's why I'm a fan of giving the raw data and letting authors/reviewers slice and dice as they see it in the spreadsheet! Though can revert to SQL if easier once we know what we want. After digging into the data we should decide what stats are interesting and so what to include in the chapter and in what format.
- I'm more than a little surprised by the number of HTTP/2 connections returning the Upgrade header. That's... supposed to be illegal. Not feedback on the presentation of the data, just... interesting. Thanks for including that.
Again good discussion on this last year - which is where a lot of these queries came from. Will be interesting to see if it's better or worse than last year.
- Percentage loaded over HTTP: Should I read this as percentage of resources on a page loaded over cleartext, given the protocol used for the base page?
Sorry don't understand your question or what you are talking bout cleartext. Is this "percentage_of_resources_loaded_over_HTTP_by_version_per_site" tab? That's any HTTP version regardless of HTTPS status.
- TLS version by HTTP version: What does blank mean here? I assume that we're not considering cleartext HTTP/2, so it's not "no TLS" for that. The sampled QUIC versions are presumably using Google Crypto, so the advertisement of any TLS version is interesting, even though small.
Yes we should dig into this more. Suspect it's QUIC and TLS version is not being recorded correctly, but that's a guess. This is a new stat for this year btw so nothing to compare on this last year. There's a lot but Google does account for a lot of traffic when looking at request level (between Google Analytics, Ads and Marketing tags, YouTube, Google Fonts..etc.) so it's possible. Definitely one to dig into @gregorywolf .
- Percentage loaded over HTTP: Should I read this as percentage of resources on a page loaded over cleartext, given the protocol used for the base page?
Sorry don't understand your question or what you are talking bout cleartext. Is this "percentage_of_resources_loaded_over_HTTP_by_version_per_site" tab? That's any HTTP version regardless of HTTPS status.
"Percentage of resources loaded over HTTP" as opposed to what? That is, where the number is less than 100% loaded over HTTP, what were the other resources loaded over? I could read this as HTTP vs. HTTPS, same versus different version used for subresources, network vs. cache, references to data:
URLs that don't hit the network, etc.
Or it's something totally different and I'm having a total mental disconnect figuring out what this query is measuring.
Ah gotcha now. Yeah I don't understand this stat either. Would expect each line to add up to 100%, so we have for example 30% HTTP/1.1 and 70% HTTP/2. @gregorywolf ?
All. I have been away for a bunch of days and am just getting back on line. I will take a look at the above comments and comment in the next few days.
Part IV Chapter 22: HTTP/2
Content team
Content team lead: @dotjs
Welcome chapter contributors! You'll be using this issue throughout the chapter lifecycle to coordinate on the content planning, analysis, and writing stages.
The content team is made up of the following contributors:
New contributors: If you're interested in joining the content team for this chapter, just leave a comment below and the content team lead will loop you in.
Note: To ensure that you get notifications when tagged, you must be "watching" this repository.
Milestones
0. Form the content team
1. Plan content
2. Gather data
3. Validate results
4. Draft content
5. Publication