HTTPArchive / legacy.httparchive.org

<<THIS REPOSITORY IS DEPRECATED>> The HTTP Archive provides information about website performance such as # of HTTP requests, use of gzip, and amount of JavaScript. This information is recorded over time revealing trends in how the Internet is performing. Built using Open Source software, the code and data are available to everyone allowing researchers large and small to work from a common base.
https://legacy.httparchive.org
Other
328 stars 84 forks source link

Add per-metric error handling and fix other issues #230

Closed GJFR closed 3 years ago

GJFR commented 3 years ago

Unfortunately, the well-known.js custom metric did not so well in the previous crawls due to errors. If one parseResponse call fails, the whole metric fails. To solve this, I have separated error handling for each call.

The metric would also fail when the site's policy wouldn't allow to load another page (e.g. CORS). This is now also handled for each parseResponse call. Since it is not possible to resolve the underlying cause of these errors, I just indicate an error has occurred while fetching a URL.

Lastly, I also fixed a capitalisation issue for string comparisons in /robots.txt.

I also see a lot of 'The user aborted a request.' errors in the our crawl data. Does this have to do with fetchWithTimeout?

function fetchWithTimeout(url) {
  var controller = new AbortController();
  setTimeout(() => {controller.abort()}, 5000);
  return fetch(url, {signal: controller.signal});
}

WPT tests (focus on the errors in previous crawls)

GJFR commented 3 years ago

Would it be possible to merge this in time for the September crawl?

jroakes commented 3 years ago

Hi @GJFR. In discussion with @VictorLeP on slack, he pointed me to this PR. I have updated a PR for a robots.txt extraction to capture some data for next year's web almanac.

I have finalized the PR here and included test cases: https://github.com/HTTPArchive/legacy.httparchive.org/pull/236

The current robot.txt code in the well-known extraction script has two issues:

1.

# Works for this
user-agent: Googlebot
disallow: /login/

# Does not work for this.  Will attribute the 'login' match to only Bingbot.
user-agent: Googlebot
user-agent: Bingbot
disallow: /login/
  1. It will match, for exampl, /book-cataloging/ and genuine-authentic-nike-shoes as secure routes.

I could update my robots.txt script to have the following output. I have already addressed #1 in my PR. I think I can make #2 better, but probably not perfect.

    "by_useragent": {
      "*": {
        "allow": 1,
        "disallow": 45,
        "crawl_delay": 0,
        "noindex": 0,
        "well_known_disallows: 5,
        "other": 0
      },
      "baiduspider": {
        "allow": 0,
        "disallow": 163,
        "crawl_delay": 0,
        "noindex": 0,
        "well_known_disallows: 2,
        "other": 0
      },
      ...
      }

I wanted to get a POV from the Security Team if we should consolidate or edit the well-known to correct some things.

GJFR commented 3 years ago

Hi @jroakes. Thanks for the heads-up. As for the issues you mentioned:

  1. This issue should indeed be corrected, I was not aware that the second example is a valid robots.txt configuration. In the context of this year's Security-chapter, we can stil use the data luckily, since we're only interested in the endpoints that are covered.

  2. This issue was also discussed in the original PR: https://github.com/HTTPArchive/legacy.httparchive.org/pull/219. It was decided to do the filtering at the querying stage, such that the data collected by the custom metric stays consistent over the years. Because like you mentioned, this kind of filtering is difficult to get perfect.

jroakes commented 3 years ago

Thanks @GJFR I think you make a really good point for pt.2 that I had not considered. FWIW, this is what I had tested in case it is helpful. It solves pt.1, and my solution for pt.2 was to 1) include oauth in keywords, and 2) exclude matches to matches that were preceded by a non-word character.

This also includes regex that should match:

disallow : useragent
disallow:     useragent
    disallow:  useragent

But doesn't clean first to solve for commented values (I did this in robots_txt.js):

disallow: /path/  #This is a secure route
text = $('html').textContent
let data = {'matched_disallows': {}};
let KEYWORDS = [
  'login',
  'log-in',
  'signin',
  'sign-in',
  'admin',
  'auth',
  'oauth',
  'sso',
  'account',
  'secure'
]

let currUserAgents = [];
let last_rule = null;
const rule_rx = new RegExp(`(user-agent|disallow)(?=\\s*:)`,'gi');
const kws = KEYWORDS.join('|');
const kw_rx = new RegExp(`[^\\w]+(${kws})`,'gi');

for(let line of text.split('\n')) {

  let matched = line.toLowerCase().match(rule_rx);

  if (matched){

    let rule = matched[0].trim();
    let value = line.slice(line.indexOf(':') + 1).trim();

    if (rule == 'user-agent') {

      if (last_rule == rule){
        currUserAgents.push(value);
      }else{
        currUserAgents = [value];
      }

    } else if (rule == 'disallow') {

      if (value.match(kw_rx)) {

        for (currAgent of currUserAgents){

          if (data['matched_disallows'][currAgent] === undefined) {
            data['matched_disallows'][currAgent] = [];
          }

          data['matched_disallows'][currAgent].push(value); 
        }

      }
    }

    last_rule = rule;
  }

}

JSON.stringify(data);

Here is test output for Linkedin: https://gist.github.com/jroakes/bb5ac67ee4584f58f5a7714d62605da1

I am not that great with JS, so sharing for a POC example only that can be improved.