foxdavidj commented 4 years ago

Part IV Chapter 22: HTTP/2

Content team

Authors	Reviewers	Analysts	Draft	Queries	Results
@dotjs	@MikeBishop @LPardue @rmarx @ibnesayeed @pmeenan @Nooshu @gregorywolf @bazzadp	@gregorywolf	Doc	*.sql	Sheet

Content team lead: @dotjs

Welcome chapter contributors! You'll be using this issue throughout the chapter lifecycle to coordinate on the content planning, analysis, and writing stages.

The content team is made up of the following contributors:

New contributors: If you're interested in joining the content team for this chapter, just leave a comment below and the content team lead will loop you in.

Note: To ensure that you get notifications when tagged, you must be "watching" this repository.

Milestones

0. Form the content team

[x] Jul 6th: Project owners have selected an author to be the content team lead
[x] Jul 13th: The content team has at least one author, reviewer, and analyst (minimally viable team formed)

1. Plan content

[x] Jul 20th: The content team has completed the chapter outline in the draft doc
[x] Jul 27th: Analysts have triaged the feasibility of all proposed metrics

2. Gather data

[x] Jul 27th: Analysts have added all necessary custom metrics and drafted a PR to track query progress
Aug 1 - 31: August crawl
[x] Sep 7th: Analysts have queried all metrics and saved the output to the results sheet

3. Validate results

[x] Sep 14th: The content team has reviewed the results sheet

4. Draft content

[x] Nov 12th: Authors have completed the first draft in the doc
[x] Nov 26th: The content team has prototyped all data visualizations

5. Publication

[x] Nov 26th: The content team has reviewed the final draft, converted to markdown, and filed a PR to add it to the 2020 content directory
Dec 9th: Target launch date

gregorywolf commented 4 years ago

I have reviewed all of the comments and had a chance to speak live with @MikeBishop and @dotjs. The following is a summary of issues/comments.

adoption_of_http_2_by_site_and_requests @bazzadp You mentioned that you dug into a concerning issue and "fixed" something but it was not in time for the August crawl. Please outline what was fixed. Should the query be modified and rerun against the now posted September data?

count_of_h2_and_h3_sites_grouped_by_server I am acknowledging the change @bazzadp made to the associated results tab pivot table

detailed_alt_svc_headers I am going to see if I can modify the query to pull out the desired data outlined by @MikeBishop:

Percentage that are "clear", the only defined keyword for this header, which we can already see from this table
Percentage that offer h2
Percentage that offer various QUIC/H3 versions
Percentage that refer to same/different host or port
Distribution of ma values
How many alternatives per protocol? How many different protocols?

Once the revised data is posted in results tab I will create a pivot table that filters by HTTP/HTTPS

percentage_of_resources_loaded_over_HTTP_by_version_per_site Acknowledging the feedback that the resulting data is confusing and ambiguous. I will figure out how to make this data more useful

tls_adoption_by_http_version Going to dig into the data to figure out why we are getting blanks for TLS version for HTTP 0.9, 1.0, 1.1, and H2

Final item is that in speaking with @dotjs he expressed a desire to incorporate TCP connection information into the results in order to draw conclusions on efficiency of protocol usage based on available bandwidth. I will dig into this but initial feedback I have received is that this may not be possible.

@pmeenan Are you aware of any field or way that we could extract meaningful bandwidth usage out of the HA crawls? @dotjs Jump in an provide additional context as required

tunetheweb commented 4 years ago

http_2_by_site_and_requests @bazzadp You mentioned that you dug into a concerning issue and "fixed" something but it was not in time for the August crawl. Please outline what was fixed. Should the query be modified and rerun against the now posted September data?

As you're aware we used the protocol field as the reqHttpVersion and respHttpVersion often was blank or contained non-sensical values like (us:). However the protocol field was also often blank (and particularly for HTTP/1.1 messages on Desktop by looks of things) so that's not ideal either :-(

Anyway I discovered that the reqHttpVersion and respHttpVersion fields try to just parse the messages for an-HTTP/1 style message (e.g. GET / HTTP/1.1 request or HTTP/1.1 200 OK response). Obviously this was never going to work in an HTTP/2 world, but it also didn't work too well for HTTP/1 messages due to some other bugs in it (which also meant that HTTP/2's status: 200 pseudo-header was parsed as if it was an HTTP/1 response - hence where the us: value came from).

So basically I fixed the reqHttpVersion and respHttpVersion fields with this pull request to fix some of the logic and also fall back to the protocol field when that still doesn't work (e.g. HTTP/2). This should give us the best of both worlds and allow us to revert back to the reqHttpVersion and respHttpVersion fields with more confidence.

This will first be available in the October crawl (it wasn't in time for August for Web Almanac, nor September). As mentioned above, I would assume the blanks are HTTP/1.1 (they appeared to be from my investigations last year) and hen we can validate this assumption once we have the October data, which should be available just before publication.

I still don't know why the protocol field is sometimes blank, but this appears to be set by the browser so less in our control to fix.

gregorywolf commented 4 years ago

The original query percentage_of_resources_loaded_over_HTTP_by_version_per_site has been revised and renamed to average protcol requests per page. The intention of this query to answer the question on a given page what are the average number of resources loaded for a given HTTP version.

The original query tls_adoption_by_http_version has been revised and the new results are now contained in TLS versions per page (same-domain). I worked closely with @bazzadp and @tomvangoethem to minimize the NULL entries. More details can be viewed in #1344

The results tab detailed_alt_svc_headers has been modified by adding two additional columns, contains h3? and contains quic?. A new tab has been created called detailed_alt_svc_headers_pivot which contains two pivots tables of the resulting column data. The last item that was done is another tab was created called detailed_alt_svc_headers_unique which is an extraction of the 'ugrade column from detailed_alt_svc_headers tab. From here it is very easy to see the various components laid out in a readable format.

MikeBishop commented 4 years ago

Thanks, @gregorywolf. Reading through the results again, here are some updated comments and observations:

count_of_h2_sites_grouped_by_server: Given that the values sum to 100%, I believe this table is showing the distribution by server of the HTTP/2 traffic. That's an interesting metric, particularly to the extent that it differs from the traffic distribution of HTTP/1.1 or overall HTTP traffic. I think the original intent was the percentage of traffic to each server type which is using HTTP/2. We should be able to create those visualizations by combining this with the count_of_non_h2_sites_grouped_by_server tab.
average protocol requests per page: I'm still having trouble parsing this. At first, I thought this was saying that a typical page loads two thirds of its subresources over HTTP/2. But the numbers sum to greater than 100%, so that interpretation doesn't work. Based on the description, I'd be expecting something like a scatterplot or CDF of number of subresources per base page, with a separate graph per protocol used to load the base page.
detailed_alt_svc_headers_pivot et al.: This is a good start, and I think the remainder can be drawn from the data already in the page. Some refinements I'll work on adding if you don't mind me editing your post-processing formulas:
- I think the split between h3 and quic is a little murky, given that we have HTTP/3 over IETF QUIC (h3-29), HTTP/3 over Google QUIC versions (h3-t051, h3-q050), and non-IETF Google QUIC (quic with v= parameters), as well as HTTP/2 (h2). We may want to clarify with someone from Google (@ianswett, @DavidSchinazi?) that we're classifying these tokens correctly, and split them into three buckets rather than two.
- I'd also like to be able to slice by whether the target is same-host or different host, since support for cross-host Alt-Svc varies so much.
- I'd like to get a distribution of the max-age values; I'm assuming they have some common peaks. It would also be interesting to see whether the max-age values ever vary within an instance; that is, does everything advertised always have equal lifetime?
measure_number_of_tcp_connections_per_site: Assuming I'm reading the lower table correctly, the impact of a multiplexed transport on this metric is minor at the median, but very noticeable at both high and low percentiles. The upper table, however, appears to just be the sum of the data in the lower table rather than indicating equivalent data across all protocol versions. I don't think it needs to be rebuilt for that (we can just ignore it), but it's misleading.
adoption_of_http_2_by_site_and_requests (which I assume is more about requests than sites) and measure_of_all_http_versions_for_main_page_of_all_sites: The combination of these two is interesting. It says that roughly half of all main pages are over HTTP/2 (and hardly any QUIC; possibly an artifact of how the run proceeds), but two-thirds of requests are over HTTP/2. That suggests the average HTTP/2-enabled site makes more requests for subresources than an HTTP/1.1-enabled site. Also that some of these subresources are loaded over QUIC. This may tend to inflate the connection number per page, since an initial request may be over HTTP/2 and an Alt-Svc header causes a new QUIC connection to be used for subresources. A second load of the page would presumably run entirely over QUIC and use fewer connections.
number_of_h2_and_h3_pushed_resources_and_avg_bytes: Obviously, these are percentiles out of the subset of connections where pushes are non-zero, which is small. The fact the QUIC appears to push more aggressively is even more notable given the previous bullet showing that QUIC is almost never used on the base page in these runs. That means that requests for subresources are pushing other things. That's not generally how we expect push to work, which is interesting.
number_of_h2_and_h3_pushed_resources_and_bytes_by_content_type: I assume each of these are out of the subset of connections where at least one resource of the given type was pushed. That leads to some interesting looking curves. For example, if any XML was pushed, exactly one XML was pushed because it's 1 at 10th and 90th percentile with the same byte count. Logic suggests that there's probably a sample size of one there. I think it would be more interesting to take these out of all connections that use push, if we can draw it in a way that's not misleading. That is, of connections that use push, what are they pushing? Do some sites push all JS while other sites push a mix of types?

ianswett commented 4 years ago

h3-t051 is a variant of gQUIC using TLS 1.3 and h3-q050 is a version of gQUIC. Both versions use IETF QUIC invariant headers.

DavidSchinazi commented 4 years ago

To clarify, here are the Alt-Svc values currently supported by google.com: 1) IETF drafts of HTTP/3: h3-29, h3-27 2) HTTP over Google QUIC versions that use the IETF Alt-Svc format: h3-Q050, h3-Q046, h3-Q043, h3-T051, h3-T050 3) HTTP over Google QUIC versions that use the legacy Google Alt-Svc format: quic; v="46,43" (note that this advertises the same Google QUIC versions that are advertised by h3-Q046, h3-Q043 in the IETF format) (also note that this old format will be removed soon so there's not much need to discuss it apart from documenting history)

tunetheweb commented 4 years ago

count_of_h2_sites_grouped_by_server: Given that the values sum to 100%, I believe this table is showing the distribution by server of the HTTP/2 traffic. That's an interesting metric, particularly to the extent that it differs from the traffic distribution of HTTP/1.1 or overall HTTP traffic. I think the original intent was the percentage of traffic to each server type which is using HTTP/2.

Be careful with the word "traffic". The HTTP Archive has no concept of traffic and crawls all it's sites evening so www.google.com will get just as much weighting as barrystinysite.com (assuming that the site meets the minimum threshold to be included in CrUX and so HTTP Archive). Better to think of it as sites rather than traffic.

adoption_of_http_2_by_site_and_requests (which I assume is more about requests than sites) and measure_of_all_http_versions_for_main_page_of_all_sites: The combination of these two is interesting. It says that roughly half of all main pages are over HTTP/2 (and hardly any QUIC; possibly an artifact of how the run proceeds), but two-thirds of requests are over HTTP/2.

Surely it's unsurprising hardly any pages are loaded over QUIC since (until very, very recently) Chrome (Which the HTTP Archive crawler uses) only loaded pages over QUIC for Google owned properties and not other sites unless a command line flag is used? Except maybe a few origin trails - was that a thing for QUIC support? Google pages are relatively few when compared to the 6.5million pages we crawl. And I've even checked a few blogspot pages and app engines (as a Google owned properties but with potentially more domains) and they don't appear to be QUIC enabled yet.

In fact I got so curious what these pages are and queried all of the QUICs sites and was surprised to see loads of non-Google properties! I've added these as a new tab.

Does anyone know what are the criteria for Chrome used in August (when the crawl ran) to decide whether QUIC was used or not? As I say I thought it was only used on Google Properties to surprised by this.

Further investigation also shows some oddities in WebPageTest and how it decides whether a request is the main page - particularly for QUIC. Doesn't look entirely accurate to me for QUIC (much more accurate for the other protocols), which might explain why ANY main pages show QUIC as, as you sake @MikeBishop , we would have expected the first request to be TCP and only subsequent requests to be QUIC (except maybe for some Google properties if QUIC support is baked into the Chrome code?).

It says that roughly half of all main pages are over HTTP/2 (and hardly any QUIC; possibly an artifact of how the run proceeds), but two-thirds of requests are over HTTP/2. That suggests the average HTTP/2-enabled site makes more requests for subresources than an HTTP/1.1-enabled site. Also that some of these subresources are loaded over QUIC. This may tend to inflate the connection number per page, since an initial request may be over HTTP/2 and an Alt-Svc header causes a new QUIC connection to be used for subresources. A second load of the page would presumably run entirely over QUIC and use fewer connections.

I'm not sure I agree with that first sentence @MikeBishop - are you not considering the impact of third-party sub resources here? For example, if example.com loads over HTTP/1.1 but then uses Google Fonts or Google Analytics then it will have an HTTP/2 (or even QUIC request) for those two sub-resources so a measure of HTTP/1.1 and HTTP/2 is incomplete to make any assumptions here unless we include the home page protocol as well. It would be a similar story for sharded domains if example.com only supported HTTP/1.1 but assets.example.com supported HTTP/2.

ianswett commented 4 years ago

Google QUIC and IETF QUIC are both enabled based on Alt-Svc advertisement. There isn't currently and to my knowledge there has never been an explicit list of 'Google sites' for which QUIC is enabled, but disabled for other sites. Akamai has been supporting Google QUIC for a while and no special configuration was necessary to allow that.

gregorywolf commented 4 years ago

average protocol requests per page: I'm still having trouble parsing this. At first, I thought this was saying that a typical page loads two thirds of its subresources over HTTP/2. But the numbers sum to greater than 100%, so that interpretation doesn't work. Based on the description, I'd be expecting something like a scatterplot or CDF of number of subresources per base page, with a separate graph per protocol used to load the base page.

@MikeBishop In regards to the percentage exceeding 100% I think this is to be expected. Since the calculation is based on average I think the averages for each protocol will get skewed based on the large data size. With that said I will reevaluate the query and figure out how to tighten up the results.

tunetheweb commented 4 years ago

Been looking at this average protocol requests per page query at @gregorywolf 's request and think I understand why. I've submitted a pull request in #1368 to fix this, though it needs reviewing. In meantime I've added the data from that new query, in addition to Greg's, to the spreadsheet and it adds up to 100% (though we still have the null protocol requests we've discussed before).

I've also added a second query showing the percentile of sites using HTTP/2 or above and it makes interesting reading I think:

Did you know that less than 7% of sites make no HTTP/2 or QUIC requests at all? Guess the likes of popular third parties (e.g. Google Analytics, Google Fonts, Facebook/Twitter advertising tracking tags) all supporting HTTP/2 or above mean just about everyone (well 93% of sites), use at least a little of the new protocols.

And 10% of sites make only HTTP/2 or QUIC requests - with no HTTP/1.1 requests at all! Originally I thought that was quite high, but the more I think about it, the more I'm surprised it's not higher since we know about half of home pages are now served over HTTP/2 and you'd think that most popular third-parties would have adopted it by now. Still it's more than the 7% of HTTP/1 only sites 🙂

Interesting stats I thought anyway, but would like someone to double check my work to make sure I'd not made a mistake in this. @gregorywolf can you look over the new queries for a start and then will also hopefully get someone else on the analysts team to check too. Will let you all know if they are changed and when merged.

dotjs commented 4 years ago

Thanks @bazzadp. I was looking into this data last night and wondered what we could capture above and beyond the percentage of total requests over http/2. If 50% of first party HTML is now HTTP/2. How does that compare with last year? I like the percentiles concept which can show but as you mention it will reflect the common third party tags. @gregorywolf Do we have the data to show resource level distributions ? Interested to see the distribution for common static asset serving domains.

tunetheweb commented 4 years ago

I added the 2019 percentiles for requests by site to the sheet for comparison. Not too different truth be told, though numbers have gone up as expected.

Last year I looked at all home pages (about 36% of home pages were served over HTTP/2) and also HTTPS only since HTTP/2 is only support in browsers over HTTPS (about 55% of HTTPS home pages were served over HTTP/2). Looks like we tried to gather that again this year and looks to be 50% overall and 65% for HTTPS.

We could look at just domains matching the home page, however that will exclude shared assets domains (e.g. assets.example.com). Might be better looking the Third Party chapter for other ideas to quantify this?

gregorywolf commented 4 years ago

Hi. I have reviewed the changes made by @bazzadp and agree the new results look good

foxdavidj commented 3 years ago

@dotjs in case you missed it, we've adjusted the milestones to push the launch date back from November 9 to December 9. This gives all chapters exactly 7 weeks from now to wrap up the analysis, write a draft, get it reviewed, and submit it for publication. So the next milestone will be to complete the first draft by November 12.

However if you're still on schedule to be done by the original November 9 launch date we want you to know that this change doesn't mean your hard work was wasted, and that you'll get the privilege of being part of our "Early Access" launch.

Please see the link above for more info and reach out to @rviscomi or me if you have any questions or concerns about the timeline. We hope this change gives you a bit more breathing room to finish the chapter comfortably and we're excited to see it go live!

dotjs commented 3 years ago

Yes saw the note. Just had a very busy week or so so the extra time is useful. Will continue with the analysis and draft.

tunetheweb commented 3 years ago

Hi all

In a previous comment above, I'd commented on the fact that 4% of requests did not list the protocol . I'd mentioned that I'd identified one reason and submitted a fix to WebPageTest and the results would be available after the October crawl. It now looks like that crawl has finished so can share these results with you.

The results are in this sheet but will summarise them for you here.

We have basically three versions of the HTTP protocol:

protocol as reported by Chrome - this is what we used for our analysis but it's sometimes blank.
response protocol as parsed from HTTP/1.1 requests (e.g. 200 OK HTTP/1.1 response lines)
request protocol as parsed from HTTP/1.1 requests (e.g. GET / HTTP/1.1 request lines)

The bug was in processing the last two incorrectly meaning it included blank lines, and also bits of the HTTP/2 pseudo headers.

It is also possible to get slightly different versions if client requests HTTP/1.0 (or even HTTP/0.9) and gets a response back as HTTP/1.1. If we look at them in that order of precedence we got below in August crawl:

http_version	desktop	mobile
	3.95%	0.34%
1.1	0.00%	0.00%
: /	0.53%	0.01%
http/0.9	0.00%	0.00%
http/1.0	0.04%	0.03%
http/1.1	30.56%	34.09%
HTTP/2	63.70%	63.78%
http/2+quic/46	1.20%	1.70%
me:	0.00%
od:	0.00%	0.00%
ori	0.01%	0.00%
Grand Total	99.99%	99.95%

So here we see our problem as we're seeing 3.95% of desktop requests unclassified and also some rubbish (1.1, :/, me:, od:, ori - the later three being the incorrect parsing of the HTTP/2 :status, :method and :origin pseudo headers). As can be seen it affects Desktop more than Mobile for some reason. It was my opinion that that 3.95% was most likely HTTP/1.1 requests as then desktop and mobile would be roughly inline, but I wanted to confirm this.

The October crawl results are shown below:

http_version	desktop	mobile
	0.05%	0.07%
h3-Q050	0.95%	1.33%
http/0.9	0.00%	0.00%
http/1.0	0.03%	0.03%
http/1.1	33.28%	32.93%
HTTP/2	65.69%	65.62%
QUIC	0.01%	0.00%
Grand Total	100.01%	99.98%

So, pleasingly there are now very few unclassified results (0.05% for desktop and 0.07% for mobile) and mobile and desktop are very much inline. Mobile has a few more h3-Q050 results, which started rolling out in Chrome in October and a few less HTTP/2 results, but those h3-Q050 results most likely would have been HTTP/2 if it was not switched on at the time of the desktop crawl at which point they are very similar.

Looking at the underlying stats in the October sheet it looks like there is still some gibberish for the request_http_version whcih I'll see if I can fix, but as that's used last precedence it's only picked up for 1 site in each crawl (where it is correctly set!) so that can be ignored for now. Will try to fix it before next year's run!

So I think it's safe to say the unclassified 3.95% is mostly HTTP/1.1. And hopefully next year we'll not have this anomaly in our stats.

Let me know if you have any questions.

DavidSchinazi commented 3 years ago

@bazzadp wrote: Mobile has a few more h3-Q050 results, which started rolling out in Chrome in October

That's not quite right. Chrome rolled out h3-Q050 in June 2020. In October 2020, Chrome rolled out h3-29 in addition to h3-Q050. In other words:

from June 2019 to June 2020: Chrome used http/2+quic/46 (that version is sometimes also referred to as h3-Q046)
from June 2020 to October 2020: Chrome used h3-Q050
since October 2020: Chrome supports both h3-29 and h3-Q050 and uses the one that the server prefers

The above should apply equally to Desktop and Mobile.

tunetheweb commented 3 years ago

Ah sorry you’re right - difficult to keep up with all these version numbers! Then maybe the difference might be just due to the difference (and extra) sites mobile crawls? We crawled 16% more mobile sites than Desktop in October, and some of them are different, so that might explain it (e.g. if desktop sites are more corporate sites with less Google Analytics and Google AdWords... etc). Anyway I’m guessing now.

Still, I think the findings still stands that the missing 4% on desktop is mostly HTTP/1.1. You can see from filtering the October sheet on where protocol is blank and you see that 3.63% of the desktop requests fall into this category but with a response version of HTTP/1.1 based on parsing the response itself.

You agree?

Btw I also submitted that further fix to WPT to avoid the weird requests versions we still see an @pmeenan has kindly merged already. So should be in a much better state next year.

I do wonder why Chrome fails to set the version for these ~4% of HTTP/1.1 request in the protocol field though and so why WPT has to fall back to finding it by parsing the response? Might dig up some examples to find it and raise a bug with the Chrome team if I do figure it out. Unless anyone here has any ideas?

DavidSchinazi commented 3 years ago

That's definitely odd. Please do file a bug at https://crbug.com ideally with repro steps (such as an example URL that's causing issues) if possible

dotjs commented 3 years ago

@bazzadp . I took a look at the third parties chapter and I think there is some interesting info if we join against the third_parties table third_party AS ( SELECT category, domain FROM `httparchive.almanac.third_parties` WHERE date = '2020-08-01' ) I started to dig into the distributions of 1st vs 3rd party by protocol and category but as I am no longer an analyst it's no longer free to query. If someone could run a query joining with the protocol request count and possibly content type.

tunetheweb commented 3 years ago

@dotjs do you have the query you want to run to hand? Or didn't get that far?

gregorywolf commented 3 years ago

@dotjs Please provide some more detail about what your looking to see and I will get the query run and post the results

tunetheweb commented 3 years ago

@dotjs , after our discussion on slack, I stole some queries and adjusted them to include the % of protocol of HTTP/2 and QUIC and came up with the following two metrics: https://docs.google.com/spreadsheets/d/1op_UrJGo7CGRXWy5iK7-aQ1lHALEm4_8gkXM2huyvL0/edit?usp=sharing

Let me know if that's along the lines of what you are thinking and, if so, can work with @gregorywolf to add these queries to the report and run against the full data set (the results are on a 10k random sample set).

Or if there's some other way you'd rather see the data then let us know.

MikeBishop commented 3 years ago

I've updated the functions for identifying h3 (h3-\d+=) and Google QUIC ((quic|h3-[qt]\d{3})=) in the Alt-Svc page, as well as added two additional columns to identify cross-host entries (="[^":]+:) and to extract the max-age. I'm not trying to handle multiple max-ages used in the same header since it doesn't appear common from a cursory glance.

gregorywolf commented 3 years ago

@dotjs , after our discussion on slack, I stole some queries and adjusted them to include the % of protocol of HTTP/2 and QUIC and came up with the following two metrics: https://docs.google.com/spreadsheets/d/1op_UrJGo7CGRXWy5iK7-aQ1lHALEm4_8gkXM2huyvL0/edit?usp=sharing

Let me know if that's along the lines of what you are thinking and, if so, can work with @gregorywolf to add these queries to the report and run against the full data set (the results are on a 10k random sample set).

Or if there's some other way you'd rather see the data then let us know.

@dotjs Please take a look at the results that @bazzadp posted. If this data provides you the info your are looking for let me know and I will have the queries run against the full data set and posted in the Results sheet.

dotjs commented 3 years ago

@gregorywolf The results describe 1st party vs 3rd party. Is it more accurate to describe as 'known 3rd party' vs other requests (i.e first party and static hosts). Could you split the query for HTTP and non-HTTP sites i.e. I'm interested if anything is different for the 3rd party distributions and I think it will disambiguate the not known 3rd party. If possible I would like to plot a CDF of the distributions. The current data tells me that under 10% of sites have less than 50% of 3rd party requests over HTTP/2 and over half the sites have 95% or more. It might be interesting to look at some more points below 25%. The same comment applies to the breakdown by content type and category.

The other ask I have is join the HTTP/2 non HTTP/2 firstHTML data with a page rank. See https://github.com/HTTPArchive/almanac.httparchive.org/issues/1378 for further details. I think it will be interesting to show if the non-adoption of HTTP/2 is indeed in the long-tail.

gregorywolf commented 3 years ago

@dotjs I have read your request and I am not really following what you are requesting. Please elaborate. Thanks.

dotjs commented 3 years ago

@gregorywolf

Can I see more percentiles than [10,25,50,75,90] as the data for non-3rd party is [0%,0%,60%, 100%, 100%]. Only 1 data point that isn't 0 or 100. Otherwise it is hard to define shape of distribution.
1. I wanted this query split by firstHTML over TLS as I wanted to see if there was a significant difference in other assets possibly being served over TSL and hence maybe H2. For TLS origins it is also interesting to see how 3rd party vs non 3rd party H2 distributions look. This is more important than 1.
Ask for a join against a page rank. Just to see if all the larger sites have all migrated to H2 and it is the smaller sites on Apache/IIS and H1.

rviscomi commented 3 years ago

A couple of thoughts:

Most chapters standardize on [10,25,50,75,90] to summarize the distribution. If there's a particular value of interest, you could measure the % below that threshold. But for the sake of communicating the distribution to readers, be cautious with how deeply statistical you get, and simpler/fewer percentiles make the distribution easier to digest.

As for page rank, we don't have a reliable data source for that info, so it's not possible. Other chapters have been interested in this too but for consistency I'm discouraging its use.

tunetheweb commented 3 years ago

I've updated and rerun the queries in my test sheet against the full dataset and with 10% percentiles (plus 5% and 95%). However not sure it makes much differences - since third-party and CDN support of HTTP/2 is so high there tends to be a very big, very quick jump from 0% to 100%. So while I think there is a good reason to move away from the fewer number of "standard percentiles" most other chapters use, to see when that cut happens, I don't think you've going to see much of a spread here.

I'm not sure what value there is to split by firstHTML? By definition third party can't be firstHTML. And we know that most sites (especially third-party's which typically use a CDN) are served over HTTPS (though I admit that protocol-relative URLs are probably still common).

Ultimately we're now very late on in the day to be continually adding new metric requests. We had the chance to suggest metrics previously and these didn't come up so think we have to look at what we've got and seeing what we can use from them. While I want you to have the data you need to write this chapter, I'm just concerned that we can start going down rabbit holes here and continually add to the data.

We need to get these bits of SQL added to git and reviewed as part of that, if you are intending to use them and then copy the data to the proper results spreadsheet - it's entirely possible I've made a mistake in this SQL! - and that will take time and effort from the other almanac analysts. So I would strongly suggest calling it a day on the data we have and seeing if we have enough in that to write the chapter.

dotjs commented 3 years ago

Agreed thanks both for keeping me on track

tunetheweb commented 3 years ago

FYI queries have been merged into repo (looks like no mistakes!) - thanks @gregorywolf for submitting #1419 and for copying data to the real spreadsheet.

dotjs commented 3 years ago

The first draft is close enough for review. There are a few questions already raised regarding the HTTP_VERSION of HTTP/2+gQUIC and QUIC. Could @gregorywolf or @barrypollard please confirm the QUIC version in particular. There is a bit more work on H3 in practice and conclusion. I'll try and review that with Lucas tomorrow.

tunetheweb commented 3 years ago

Could @gregorywolf or @barrypollard please confirm the QUIC version in particular.

We observed two values in the August data:

HTTP/2+QUIC/46
QUIC

For the first, this comment further up from @DavidSchinazi says that http/2+quic/46 is sometimes also referred to as h3-Q046.

For the second I'm not sure what the version is. That's all that was reported. The alt-svc header has the following:

"name": "alt-svc",
"value": "h3-29=\":443\"; ma=2592000,h3-27=\":443\"; ma=2592000,h3-T050=\":443\"; ma=2592000,h3-Q050=\":443\"; ma=2592000,h3-Q046=\":443\"; ma=2592000,h3-Q043=\":443\"; ma=2592000,quic=\":443\"; ma=2592000; v=\"46,43\""

And you can see, if you scroll to the right, there is an alt-svc type of just quic. @DavidSchinazi any ideas what this is?

DavidSchinazi commented 3 years ago

quic was the old Alt-Svc name for QUIC, and it communicated the specific version of QUIC using the v parameter. In this case, quic=":443"; v="46,43" means the exact same thing as h3-Q046=":443", h3-Q043=":443". We deprecated that old quic format in version h3-Q047, so newer versions such as h3-Q050 no longer use it.

tunetheweb commented 3 years ago

OK so based on that both QUIC and HTTP/2+QUIC/46 is basically IETF QUIC? And probably draft 46. Any reason why we have two values for that?

Also it means that gQUIC is basically not measured - which ties in with last year when we didn't see gQUIC in our stats despite it being used, and saw it captured under HTTP/2.

If that all correct? Or am I misunderstanding this?

If so, @dotjs it seems like we should just merge those status under HTTP/3 (possibly with a caveat that it's not the final version of HTTP/3) and make a note that gQUIC is not measured separate from HTTP/2.

rmarx commented 3 years ago

I wouldn't say that Q046 is entirely the same as IETF QUIC (there's no such thing as draft 46), but that it's also not the original gQUIC anymore, but kind of an in-between version with gQUIC evolving to IETF QUIC over time (correct me if I'm wrong, @DavidSchinazi).

I don't think that type of nuance necessarily has to be conveyed here though and we can just name this HTTP/3 (though indeed mentioning that it's experimental versions of H3).

DavidSchinazi commented 3 years ago

Here's the history: Google QUIC was a project at Google providing an alternative to TLS/TCP. At the time, the mindset was that we would run HTTP/2 over QUIC. When QUIC was brought to the IETF, the group decided to make more changes to the HTTP/2-over-QUIC layer and, after those, decided to rename HTTP/2-over-QUIC to HTTP/3. At that time, the IETF decided that the Alt-Svc for the HTTP/3 RFC would be h3 and that the ALPN for IETF QUIC drafts would be h3-nn where nn is the draft number (e.g., the most widely deployed today is h3-29). After this, Google decided to rename Google QUIC versions to match the IETF format: so Google replaced http/2+quic/46 with h3-Q046 - they're still the same version of HTTP and of Google QUIC, it's just that it has a new name that's more consistent with IETF QUIC.

So, today, we have:

h3-nn (where nn is a number) is IETF QUIC draft nn -- today Google supports only h3-29.
h3-Q0nn or h3-T0nn(where nn is a number) is Google QUIC version nn-- today Google supports h3-Q043, h3-Q046, h3-Q050, and h3-T051.

tunetheweb commented 3 years ago

Ah ok so I did get it completely the wrong way about 😀 Thanks for explaining.

So we only have gQUIC for the time of the crawl and no IETF QUIC (am sure that would be different if we crawled now but we’re basing our data on the August crawl). And so we should just treat QUIC and HTTP/2+QUIC/46 both as the same and as gQUIC.

DavidSchinazi commented 3 years ago

Yes, anything that involves http/2+quic/nn or Alt-Svc: quic is guaranteed to be gQUIC

rviscomi commented 3 years ago

@dotjs @MikeBishop @LPardue @rmarx @ibnesayeed @pmeenan @Nooshu @gregorywolf @bazzadp @gregorywolf All: this chapter's draft is looking great, thank you all for your hard work! If all reviewers have already read it and left their feedback then we're in great shape to have it ready for the launch in two weeks. If not, please try to submit all of your feedback by the end of the week to keep us on schedule. Thanks!

ibnesayeed commented 3 years ago

The chapter looks in a very good shape. I have provided some feedback.

HTTPArchive / almanac.httparchive.org

HTTP/2 2020 #921

Part IV Chapter 22: HTTP/2

Content team

Milestones

0. Form the content team

1. Plan content

2. Gather data

3. Validate results

4. Draft content

5. Publication