Ranking of crawl data - a discussion and potential option

tunetheweb commented 3 years ago

There has been a few requests to be able to rank the data to give some indication of popular or used sites.

This has been discussed in the past by the HTTP Archive team and also in https://github.com/HTTPArchive/httparchive.org/issues/125 and https://github.com/HTTPArchive/legacy.httparchive.org/issues/75 and https://github.com/HTTPArchive/hosts/issues/1 but we haven't yet been able to come up with an easy solution to this.

As I understand it, the HTTP Archive used to use the Alexa top 1 million list to do the crawl, but then moved to using CrUX, which is a list of popular domains visited by Chrome users. While this satisfies the HTTP Archive's desire to get a list of popular sites to crawl, it does mean all sites are treated equally and there is no concept of ranking. So my tiny little blog gets the same relevance is Google.com or other much more visited sites in the HTTP Archive data set.

Alternatives have been proposed including Tranco which is available in BigQuery already, however previous investigations have shown these do not cover the dataset of the HTTPArchive with as many as 75% of our sites unranked.

@tomvangoethem and I (mostly Tom!) spent a good bit of time digging into the Tranco list and collected some data which we've shared in this Sheets document and now have a better understanding of the differences and also a proposal of how we could still use it.

Some points to note:

The HTTP Archive crawls sites based from CrUX usage.
Most rankings are based on the top level domain (TLD) (e.g. example.com), perhaps supplemented with some popular subdomains rather than sites.
Few seem to use effective top level domain (eTLD) or Public Suffix List which recognises sites where subdomains are managed by separate entities (e.g. blogspot.com, wordpress.com, github.io...etc.)
This means any attempt to match the HTTP Archive data to ranking sets will mean we either need to convert sites to domains (thus having many sites with the same ranking) or only look at the overlap where we have the top level domain in both.
The source of ranking data can lead to other discrepancies too. The Tranco Methodology shows they source their data from the Alexa list, but also Cisco Umbrella, which is based on DNS data. This means they are ranking based on DNS lookups which effectively means based on requests and not sites. They also warn this potentially includes internal domains not available on the public internet (e.g. *.ec2.internal for Amazon EC2 instances, or typos such as google.conm)

Digging into the data we see that for August 2020 (which is the data set we are using for the 2020 edition of the Web Almanac) we find the following:

The HTTP Archive crawled 6.35 million sites on mobile and 5.59 million sites on desktop.
The Tranco list for August includes 6.73 million domains.

Pretty similar so sounds promising!

People like the top million and it's often used in other research so let's look at that first. Personally I don't think it's necessary to have a complete match - if we wanting to look at certain stats based on ranking then we should look at the top sites, as you end up with a lot of noise at the other end (plus the cut off for CrUX is not that openly available or discussed), so the top 1M is a good place to start.

For the Tranco Top 1M we have an exact match of 206,256 of our mobile sites and 197,073 of our desktop sites - meaning we only have coverage for 20% of the Tranco top 1 million. Which sounds pretty poor 😢

However digging into the sites we see many of the missing "sites" that are in Tranco are in fact not sites people would visit but resources (as I hinted at above):

rank	domain
19	doubleclick.net
20	googletagmanager.com
24	youtu.be
35	goo.gl
41	fbcdn.net
46	microsoftonline.com
50	google-analytics.com
53	ytimg.com
73	googlevideo.com
84	googleadservices.com

So that explains some of why the the Tranco sites are missing in HTTPArchive.

Looking at it from the other way it around (which HTTP Archive sites are missing from Tranco) we see Tranco only covers about 16% of our sites which is where we kind of stopped looking before. Finding which sites are missing is a little more complicated this way round, because we cannot order by a ranking so end up with a random list of sites/domains. However we know Tranco is based on domain and HTTP Archive is based on site, so looking at the domains, instead of an exact host match, we see a much better 58% of our domains are covered:

category	total	pct
DOMAINS_EXISTS	3,256,670	58.22%
DOMAINS_DO_NOT_EXISTS	2,336,972	41.78%
Totals	5,593,642	100.00%

That's still quite low, but given that Tranco includes many "sites" we don't, and that we have roughly the same number of sites, it's not surprising our long tail includes sites not included.

What started all this discussion and investigation, was that @tomvangoethem had up with something a bit special because for the Security chapter, he wanted to know if (as presumed) popular sites deployed these features more often and what the drop off was. So he looked at the Tranco top 1 Million and attempted to match to the HTTPArchive if we have a crawl for the domain (e.g. if we have https://example.com in our site), and if that fails try www version (https://www.example.com) and if that fails pick any site at random on that domain from our data set. This logic is basically implemented with this column in a subquery (the full query is given at the bottom):

ARRAY_AGG(NET.HOST(urlShort) ORDER BY IF(NET.HOST(urlShort) = NET.REG_DOMAIN(urlShort), 0, IF(NET.HOST(urlShort) = CONCAT('www.', NET.REG_DOMAIN(urlShort)), 1, 1+RAND())))[OFFSET(0)] AS host

I was a little concerned with the random factor (would it produce different results each time?), so we also came up with a version looking at only base or www matches.

That leads to the following:

Tranco top 1 million in HTTP Archive	mobile	desktop
Exact Match of Host	206,256	197,073
Base Host or www Host	487,488	509,230
Base Host or www Host or random	518,979	537,306
Exact Match of Host (% of 1M)	21%	20%
Base Host or www Host (% of 1M)	49%	51%
Base Host or www Host or random (% of 1M)	52%	54%

So now we're up to about 50% match of Tranco top 1 million - and above that even if we use the random logic for further matching (though as I say I'm concerned that will lead to too much variability). Now while 50% is not the 100% we'd ideally have, it is still a lot of data that is likely to lead to statistically significant insights providing we can show it's reasonably representative and not missing any large data compared to the rest of our data set.

It's important to note that this is neither the Tranco top 1M, nor the HTTP Archive ~6M but is the best match we can get between the two - the TrancoHTTPArchive top 500k if you will 😀 Anyway, this could form the basis of a ranking query for those that wanted to look at insights in the top sites.

It is also important to realise that doing this will be excluding subdomains. So while we may have many xxx.wordpress.com domains in the HTTPArchive data set (25,367 desktop sites apparently!), we will only have 1 in this merged dataset (https://wordpress.com/).

Still for many shared domains like that, they may have similar features (e.g. for the Security chapter sites be not be able to have much choice in enabling or disabling security features in wordpress.com hosted sites), this may be OK. Also, arguably, many of those those shared hosting sites will rank less and so are over represented in the unranked HTTP Archive dataset. For other chapters this squashing down may not be so useful (for the SEO chapter for example, all wordpress.com sites will be radically different, so this proposal may not be so useful).

However, on a similar note, large corporates or sites with many instances of subdomains in the HTTP Archive dataset will also be reduced to 1 top level site. For example there would only be one google.com entry instead of the 388 entries we have in the full HTTP Archive dataset (maps.google.com, analytics.google.com...etc.) and those sites may all be radically different and https://www.google.com may not be representative.

Anyway, we also dug further into the data to see if any large subdomains in the HTTP Archive data set were missing from the Tranco list which might skew results, but it doesn't look like it - there are some country level subdomains with a lot of differences, but those country level subdomains are also well represented, so looks like they are certainly not being ignored completely - which was one of my concerns.

We also looked at the Top 1M in "bins" of 10,000 sites so see if there was good coverage - or would we have a lot of coverage in first "bin", but patchy coverage as we dropped lower down the top 1M? That doesn't look to be the case and while it does drop as you go further down the list, we still have a good level of coverage right the way down to bin 100. The first 10 are shown below, but see the sheet for the full set of 100.

bin	desktop	mobile
1	89.67%	85.01%
2	81.14%	76.81%
3	73.75%	69.67%
4	72.07%	68.21%
5	69.68%	65.43%
6	63.39%	59.25%
7	65.48%	62.09%
8	57.57%	55.21%
9	65.54%	62.60%
10	66.20%	63.60%

So this seems to have potential.

There are also other alternatives, such as grab the top 1 million sites that overlap and use that if you really want to make that million mark and bucket them into 10k segments. Or perhaps looking at the topm 100, 1,000 or even 10,000 sites which are highly likely to have a base domain or www domain in our crawl.

So, to sum up this long issue, below is some SQL that @tomvangoethem came up with that allows him to see Security Header adoption across the Tranco top 1 Million sites (though in reality it's only matching 468k of that million on mobile and 495k on desktop). The results confirm suspicions that more popular sites are deploying these security features, and also lets us see the drop as we go down the top 1M/top 500k.

What's people's thoughts on this methodology and as an option for those that want ranking data? I think it needs to be used with caution, and an understanding that it's not the same as the rest of the dataset, but I do think it's an interesting option for those that want it. As we know, our own data set is not definitive either only covering home pages and people lean on CrUX data for that reason too so this is another similar alternative.

#standardSQL
SELECT
  req.client,
  headername,
  CEIL(rank / 10000) AS bin,
  COUNT(0) AS total_pages,
  COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername))) AS freq,
  COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername))) / COUNT(0) AS pct
FROM
  `httparchive.almanac.requests` AS req
INNER JOIN (
  SELECT
    client,
    rank,
    ARRAY_AGG(NET.HOST(urlShort) ORDER BY IF(NET.HOST(urlShort) = NET.REG_DOMAIN(urlShort), 0, IF(NET.HOST(urlShort) = CONCAT('www.', NET.REG_DOMAIN(urlShort)), 1, 2)))[OFFSET(0)] AS host
  FROM
    `httparchive.almanac.requests` AS r
  INNER JOIN
    `tranco.daily.daily` AS t
  ON NET.REG_DOMAIN(r.urlShort) = t.domain
  WHERE
    t.date = '2020-08-01' AND
    r.date = '2020-08-01' AND
    NET.HOST(urlShort) = NET.HOST(page) AND
    firstHtml AND
    rank <= 1000000
  GROUP BY
    client,
    NET.REG_DOMAIN(urlShort),
    rank
) AS ranked_hosts ON NET.HOST(urlShort) = host AND ranked_hosts.client = req.client,
UNNEST(['Content-Security-Policy', 'Cross-Origin-Embedder-Policy', 'Cross-Origin-Opener-Policy', 'Cross-Origin-Resource-Policy', 'Expect-CT', 'Feature-Policy', 'Permissions-Policy', 'Referrer-Policy', 'Report-To', 'Strict-Transport-Security', 'X-Content-Type-Options', 'X-Frame-Options', 'X-XSS-Protection']) AS headername
WHERE
  date = '2020-08-01' AND
  NET.HOST(urlShort) = NET.HOST(page)  AND
  (host = NET.REG_DOMAIN(host) OR host = CONCAT('www.', NET.REG_DOMAIN(host))) AND
  firstHtml
GROUP BY
  client,
  headername,
  bin
ORDER BY
  client,
  headername,
  bin

rviscomi commented 3 years ago

I'm not opposed to using ranking data, but all of the options I've seen have really poor coverage. 58% coverage at the domain level is not great, so TBH it doesn't seem like Tranco is much better. Also I would ideally like to see us vet a data source before analysis starts to make it equally available to all chapters.

tunetheweb commented 3 years ago

Why do you feel we need full coverage?

We have a huge dataset in he HTTP Archive (6.5 million sites) but anything after the first X % of ranking is going to be very, very variable and difficult to depend on in terms of ranking. In fact I even think the top 1 million is too much for that in terms of ranking. So it's really only the top end that is interesting. In which case 58% is more than sufficient IMHO - providing we can have confidence it's representative of our dataset, which is what this post is attempting to do.

So I don't think we should using ranking for most of our queries, and certainly not after you get past the first X % or X thousand sites - whatever that X might be. We've basically gone with 500,000 in this example based on the intersection of Tranco 1M and HTTPArchive.

I still think having the full 6.5million view of the HTTP Archive is useful for all of our stats, so would definitely discourage over use of ranking or only looking at a subset of our 6.7 million data set. However, what I do think the chapters want, is the ability to clarify that usage by saying things like "but the uptake of feature Y is much greater in the more popular domains". Having some ranking data of those popular domains is required for that - even if it doesn't include the bottom half of our domains (which would be filtered out anyway when presenting results).

Agree we should vet the source and ideally we'd get a source based on sites rather than domains but that doesn't seem to exist. I was pretty cynical about this source based on previous discussions about its coverage (or lack thereof), and also that Tom is involved in Tranco. However I've come round to it after some digging and think it is useful.

But I'm not wedded to Tranco and I'd also imagine any other dataset would be equally useful to be honest and it would be nice to pick a stat (e.g. the Security Header usage stat we've gone with here) and run it in multiple data sources to see if the results end up roughly equal for the top X sites/domains - which I'd hope they would. Call me hopelessly naive, but the the purposes I'm considering using it for, I think we just need a listing of the most popular domains/sites which is is roughly accurate, rather than a definitive ranking of all 6.5 million of our sites in the exact order. Perfection is the enemy of good and all that - as can be seen by the fact this has bounced around for a number of years without much progress.

tomvangoethem commented 3 years ago

Is there any particular reason why we need higher coverage? I would say that statistics-wise, 58% is a pretty high coverage. E.g. consider that you want to analyse a phenomenon that occurs in 10% of your population, when taking a random sample of 58% of that population, you will find the same results with a small margin of error. I wrote a quick simulation of this (a statistician would probably use a power analysis for this):

import random
from math import floor

population_size = 1000000
sample_size_percent = 58
phenomenon_prevalence_percent = 10
num_with_phenomenon = floor(population_size * phenomenon_prevalence_percent / 100)
sample_size = floor(population_size * sample_size_percent / 100)
x = [1] * num_with_phenomenon + [0] * (population_size - num_with_phenomenon)
assert len(x) == population_size
prevalence_all = sum(x) / population_size
for _ in range(10):
    random.shuffle(x)
    prevalence_sample = sum(x[:sample_size]) / sample_size
    relative_error = abs(prevalence_sample - prevalence_all) / prevalence_sample
    print('All: %.3f%%\t- Sample: %.3f%%\t- Relative error: %.2f%%' % (prevalence_all * 100, prevalence_sample * 100, relative_error * 100))

Example output:

All: 10.000%    - Sample: 9.991%    - Relative error: 0.09%
All: 10.000%    - Sample: 10.002%   - Relative error: 0.02%
All: 10.000%    - Sample: 10.017%   - Relative error: 0.17%
All: 10.000%    - Sample: 9.977%    - Relative error: 0.23%
All: 10.000%    - Sample: 10.011%   - Relative error: 0.11%
All: 10.000%    - Sample: 9.997%    - Relative error: 0.03%
All: 10.000%    - Sample: 10.024%   - Relative error: 0.24%
All: 10.000%    - Sample: 9.960%    - Relative error: 0.40%
All: 10.000%    - Sample: 9.984%    - Relative error: 0.16%
All: 10.000%    - Sample: 10.023%   - Relative error: 0.23%

For this run the highest error was 0.04 percentage points (relative error of 0.4%). Of course this is assuming an unbiased random sample, which might not be entirely the case when taking the Tranco+HTTTPArchive "union", but I believe it still gives a representative estimation. Given that the HTTPArchive dataset is based on just the landing page, this introduces a certain error, which I would estimate as significantly higher than this sampling error. There's been some recent research on this; to quote from the paper's conclusion:

We compared the landing page with a set of internal pages for 1000 web sites, constituting Hispar H1K, and discovered that landing pages differ substantially from internal pages. These differences have implications for virtually all web-perf. studies that examined simply the landing pages.

FWIW, there are also other factors that introduce a measurement error (compared to human browsing behaviour); some factors (e.g. IP address, location, OS) are explored in this paper.

In terms of vetting Tranco: given that the Tranco list is based on 3 other lists (Alexa, Majestic, Umbrella), which each are not perfect, the resulting list will also not be perfect. Nevertheless, by how Tranco combines the lists, it provides some beneficial properties such as better stability over time. We did some extensive evaluations in the original paper as well as in a follow-up longitudinal analysis. One of the design choices of Tranco (in hindsight perhaps not such a good one) was to include all domains, not just web sites (as some prior research relied on popularity rankings for non-web studies). Therefor it is to be expected that there's only a limited overlap with the CrUX list. It also makes that filtering on the CrUX list a sensible thing to do (we actually also provide this as a filter for customized lists on the Tranco site). As such, an alternative option could be to take 1M CrUX sites ordered by the Tranco ranking, which would give a 100% coverage for that 1M list (example query for that below). Personally, I wouldn't go for this approach though; taking the Tranco top 1M is more in line with what other studies use.

SELECT
  domain,
  host,
  tranco_rank,
  ROW_NUMBER() OVER () AS filtered_rank
FROM (
  SELECT
    domain,
    ARRAY_AGG(NET.HOST(urlShort) ORDER BY IF(NET.HOST(urlShort) = NET.REG_DOMAIN(urlShort), 0, IF(NET.HOST(urlShort) = CONCAT('www.', NET.REG_DOMAIN(urlShort)), 1, 1+RAND())))[OFFSET(0)] AS host,
    tranco.rank AS tranco_rank,
  FROM
    `tranco.daily.daily` AS tranco
  INNER JOIN
    `httparchive.summary_pages.2020_08_01_desktop`
  ON
    NET.REG_DOMAIN(url) = domain
  WHERE
    date = "2020-08-01"
  GROUP BY
    domain,
    tranco.rank
  ORDER BY
    tranco_rank
  LIMIT
    1000000)

There are only a limited number of sources with ranking information; there is a general agreement (based on our analyses and academic peer-review) that Tranco improves the features of the individual lists it is composed of, so it wouldn't make much sense to use one of these lists in favour of Tranco. The only other list that I'm aware of that could be useful, is the Trexa top 100k list, which is a combination of Tranco + Alexa; so in some sense the regular Tranco list but where the Alexa list has a higher weight. Their analysis shows a large overlap with Tranco (~80% for the top 10k). Personally, I think that Tranco-filtered-on-CrUX would probably yield better/more representative results with regards to human web traffic (CrUX would filter out all the non-website domains).

tunetheweb commented 3 years ago

Closing. Will re-review for 2021.

HTTPArchive / almanac.httparchive.org

Ranking of crawl data - a discussion and potential option #1378