HTTPArchive / httparchive.org

The HTTP Archive website hosted on App Engine
https://httparchive.org
Apache License 2.0
334 stars 42 forks source link

Investigate scope of URLs that redirect #197

Closed rviscomi closed 4 years ago

rviscomi commented 4 years ago

Context: 1 and 2

Our URLs come from the CrUX dataset based on real Chrome usage. It's possible that when we test these URLs, we're getting redirected because of a lack of authentication, geo-blocking, or other discrepancies from real user expectations. It's also possible that these origins are always supposed to redirect and maybe CrUX is mistakenly assigning the UX data from the canonical origin to the one that redirects.

Analyze the dataset for examples of base page URLs that redirect. It'd be good to understand how many initial URLs in our corpus are redirecting, if there are any patterns (eg lack of authentication), and what can be done to deduplicate results.

rviscomi commented 4 years ago

Here's a look at the top 10 HTTP status codes for desktop/mobile tests' initial request:

SELECT
  client,
  req AS total,
  status.value AS status,
  status.count,
  status.count / req AS pct
FROM (
  SELECT
    _TABLE_SUFFIX AS client,
    APPROX_TOP_COUNT(status, 10) AS status,
    COUNT(0) AS req
  FROM
    `httparchive.summary_requests.2020_02_01_*`
  WHERE
    firstReq
  GROUP BY
    client),
  UNNEST(status) AS status 
ORDER BY
  pct DESC
client total status count pct
desktop 3815604 200 OK 3373097 88.40%
mobile 5091698 200 OK 4293166 84.32%
mobile 5091698 302 Found 490910 9.64%
desktop 3815604 302 Found 218252 5.72%
mobile 5091698 301 Moved Permanently 284433 5.59%
desktop 3815604 301 Moved Permanently 206511 5.41%
mobile 5091698 307 Temporary Redirect 10744 0.21%
desktop 3815604 307 Temporary Redirect 7421 0.19%
desktop 3815604 204 No Content 4983 0.13%
desktop 3815604 303 See Other 4910 0.13%
mobile 5091698 303 See Other 6027 0.12%
mobile 5091698 204 No Content 5844 0.11%
desktop 3815604 308 Permanent Redirect 215 0.01%
mobile 5091698 308 Permanent Redirect 237 0.00%
mobile 5091698 0 236 0.00%
desktop 3815604 0 155 0.00%
desktop 3815604 206 Partial Content 36 0.00%
mobile 5091698 203 Non-Authoritative Information 46 0.00%
mobile 5091698 206 Partial Content 42 0.00%
desktop 3815604 203 Non-Authoritative Information 9 0.00%

I've manually added the name of the status code for clarification.

A few high level observations:


Here's an analysis of the Location header in relation to the URL of the first request. It counts the number of pages that redirect to a Location on the same domain.

SELECT
  _TABLE_SUFFIX AS client,
  status,
  STARTS_WITH(resp_location, '/') OR NET.REG_DOMAIN(resp_location) = NET.REG_DOMAIN(url) AS same_domain_redirect,
  COUNT(0) AS count
FROM
  `httparchive.summary_requests.2020_02_01_*`
WHERE
  firstReq AND
  status IN (301, 302, 307, 308)
GROUP BY
  client,
  status,
  same_domain_redirect
HAVING
  same_domain_redirect IS NOT NULL
ORDER BY
  count DESC
client status same_domain_redirect count total pct
mobile 302 TRUE 453,342 490,910 92.35%
mobile 301 TRUE 248,106 284,433 87.23%
desktop 302 TRUE 191,159 218,252 87.59%
desktop 301 TRUE 181,387 206,511 87.83%
mobile 301 FALSE 31,973 284,433 11.24%
desktop 301 FALSE 22,045 206,511 10.67%
mobile 302 FALSE 15,985 490,910 3.26%
desktop 302 FALSE 13,092 218,252 6.00%
mobile 307 TRUE 10,214 10,744 95.07%
desktop 307 TRUE 6,987 7,421 94.15%
mobile 307 FALSE 507 10,744 4.72%
desktop 307 FALSE 418 7,421 5.63%
mobile 308 TRUE 213 237 89.87%
desktop 308 TRUE 200 215 93.02%
mobile 308 FALSE 22 237 9.28%
desktop 308 FALSE 13 215 6.05%

I've manually copied the "total" column from the previous results and calculated the "pct". So 92% of mobile 302 responses redirect to the same domain.

Of the 284,433 mobile 301 (permanent) redirects, 87% point to the same domain. Desktop is similar.

In total, 90% of all first request redirects point to the same domain. There may be a redirect chain which ends up on another domain, which isn't accounted here. 35K desktop pages and 50K mobile pages redirect to a different domain.


Finally, here's a look at the domains that get redirected to:

SELECT
  client,
  status,
  redirect_domain,
  COUNT(0) AS count
FROM (
  SELECT
    _TABLE_SUFFIX AS client,
    status,
    NET.REG_DOMAIN(resp_location) AS redirect_domain,
    STARTS_WITH(resp_location, '/') OR NET.REG_DOMAIN(resp_location) = NET.REG_DOMAIN(url) AS same_domain_redirect
  FROM
    `httparchive.summary_requests.2020_02_01_*`
  WHERE
    firstReq AND
    status IN (301, 302, 307, 308))
WHERE
  NOT same_domain_redirect
GROUP BY
  client,
  status,
  redirect_domain
ORDER BY
  count DESC
LIMIT
  50
client status redirect_domain count
desktop 302 medium.com 1,221
mobile 302 medium.com 1,163
mobile 302 indapass.hu 764
mobile 301 jimdofree.com 755
mobile 301 listcrawler.eu 688
mobile 301 linkfire.com 613
desktop 302 indapass.hu 591
desktop 301 linkfire.com 520
mobile 302 google.com 467
desktop 302 google.com 386
mobile 302 elsevierhealth.com 302
desktop 302 elsevierhealth.com 301
mobile 302 clickfunnels.com 300
desktop 301 jimdofree.com 277
desktop 302 clickfunnels.com 268
desktop 302 stremanp.com 209
mobile 302 stremanp.com 208
mobile 302 roberat.com 157
mobile 301 google.com 149
mobile 307 vchecks.me 142
desktop 302 gitbook.com 131
mobile 301 tripadvisor.com 100
desktop 302 blogger.com 99
mobile 301 pornvida.com 95
mobile 302 w88in.com 93
mobile 302 note.com 90
desktop 302 note.com 88
desktop 302 onelogin.com 87
desktop 301 qodeinteractive.com 82
desktop 301 google.com 81
desktop 301 listcrawler.eu 74
mobile 302 engagingnetworks.net 73
mobile 302 timeweb.ru 71
desktop 302 engagingnetworks.net 70
desktop 301 unblockit.red 69
desktop 302 imodules.com 61
desktop 301 tripadvisor.com 60
mobile 302 vueher.com 59
mobile 302 booking.com 58
mobile 302 imodules.com 58
desktop 302 st-hatena.com 58
mobile 302 gitbook.com 55
mobile 301 unblockit.red 55
desktop 302 vueher.com 53
mobile 302 facebook.com 50
mobile 301 bongacams5.com 48
desktop 307 vchecks.me 47
desktop 302 booking.com 47
mobile 301 surveygizmo.com 47
desktop 301 surveygizmo.com 46

Medium gets over 1K 302 (temporary) redirects. The top 301 (permanent) redirect locations are:

There are also 100 301 redirects to tripadvisor.com. I looked into these and the firstReq seems to be misattributed to the wrong request (not the first one). The others seem ok.


So to sum up, about 15% of tests' initial request get a status other than 200 OK. 301 Moved Permanently accounts for about 5% of initial requests. Only about 10% of that 5% redirect to another domain. The domains that do get redirected to most often seem to be aggregators that host lots of content.