HTTPArchive / custom-metrics

Custom metrics to use with WebPageTest agents
Apache License 2.0
19 stars 22 forks source link

Fail the test if the test redirected to a cross-origin page #102

Closed pmeenan closed 9 months ago

pmeenan commented 9 months ago

This uses a new agent feature where the test result can be overridden by a custom metric named test_result to fail any test that ends up on a different origin from where the test started.

Sample test of https://www.google.com/ (successful) - here Sample test of http://www.google.com/ (failed) - here

Screenshot 2023-12-04 at 4 54 35 PM

rviscomi commented 9 months ago

Are all cross-origin redirects necessarily bad? For example, if the only origin we have for example.com is http and during testing it redirects to https, I think it's worth keeping it. However, if we have both http and https versions, and one redirects to the other, I think it's better to dedupe and drop the one that redirects. Similarly, if foo.com redirects to bar.com, that might be ok if bar.com isn't already in the corpus.

pmeenan commented 9 months ago

We would already have https://example.com/ in the corpus if it got sufficient traffic and the https page construction and stats aren't representative of the non-https case. Are there any cases you can think of where there IS a redirect but we don't already have the redirect target in the corpus?

tunetheweb commented 9 months ago

Yeah I think it either should have enough traffic to be in the CrUX dataset independently anyway or it's a redirect we should be ignoring.

The only example I can think of is if they migrate a site between end of the month and the crawl starting. But even then that should be a once off hit and then back in again next month with the correct URL.

rviscomi commented 9 months ago

Yeah I think you're both right that the number of cases this would matter would be insignificantly small.

WITH cross_origin AS (
  SELECT
    httparchive.fn.GET_ORIGIN(page) AS origin
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-11-01' AND
    client = 'mobile' AND
    NOT is_root_page AND
    httparchive.fn.GET_ORIGIN(root_page) != httparchive.fn.GET_ORIGIN(page)
),

root_pages AS (
  SELECT
    httparchive.fn.GET_ORIGIN(root_page) AS origin
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-11-01' AND
    client = 'mobile' AND
    is_root_page
)

SELECT
  COUNTIF(root_pages.origin IS NOT NULL) AS origin_exists,
  COUNT(0) AS total,
  COUNTIF(root_pages.origin IS NOT NULL) / COUNT(0) AS pct
FROM
  cross_origin
LEFT JOIN
  root_pages
USING
  (origin)

This shows that there are 55,113 cross-origin redirects in the mobile dataset, using cross-origin secondary pages as a heuristic. And of those, 40,415 (73%) redirect to an origin that already exists in the corpus (google.com, etc).

github-actions[bot] commented 9 months ago
Custom metrics for https://almanac.httparchive.org/en/2022/ WPT test run results: http://webpagetest.httparchive.org/results.php?test=231205_K8_1