Closed pmeenan closed 9 months ago
Are all cross-origin redirects necessarily bad? For example, if the only origin we have for example.com is http and during testing it redirects to https, I think it's worth keeping it. However, if we have both http and https versions, and one redirects to the other, I think it's better to dedupe and drop the one that redirects. Similarly, if foo.com redirects to bar.com, that might be ok if bar.com isn't already in the corpus.
We would already have https://example.com/ in the corpus if it got sufficient traffic and the https page construction and stats aren't representative of the non-https case. Are there any cases you can think of where there IS a redirect but we don't already have the redirect target in the corpus?
Yeah I think it either should have enough traffic to be in the CrUX dataset independently anyway or it's a redirect we should be ignoring.
The only example I can think of is if they migrate a site between end of the month and the crawl starting. But even then that should be a once off hit and then back in again next month with the correct URL.
Yeah I think you're both right that the number of cases this would matter would be insignificantly small.
WITH cross_origin AS (
SELECT
httparchive.fn.GET_ORIGIN(page) AS origin
FROM
`httparchive.all.pages`
WHERE
date = '2023-11-01' AND
client = 'mobile' AND
NOT is_root_page AND
httparchive.fn.GET_ORIGIN(root_page) != httparchive.fn.GET_ORIGIN(page)
),
root_pages AS (
SELECT
httparchive.fn.GET_ORIGIN(root_page) AS origin
FROM
`httparchive.all.pages`
WHERE
date = '2023-11-01' AND
client = 'mobile' AND
is_root_page
)
SELECT
COUNTIF(root_pages.origin IS NOT NULL) AS origin_exists,
COUNT(0) AS total,
COUNTIF(root_pages.origin IS NOT NULL) / COUNT(0) AS pct
FROM
cross_origin
LEFT JOIN
root_pages
USING
(origin)
This shows that there are 55,113 cross-origin redirects in the mobile dataset, using cross-origin secondary pages as a heuristic. And of those, 40,415 (73%) redirect to an origin that already exists in the corpus (google.com, etc).
This uses a new agent feature where the test result can be overridden by a custom metric named
test_result
to fail any test that ends up on a different origin from where the test started.Sample test of https://www.google.com/ (successful) - here Sample test of http://www.google.com/ (failed) - here