HTTPArchive / legacy.httparchive.org

<<THIS REPOSITORY IS DEPRECATED>> The HTTP Archive provides information about website performance such as # of HTTP requests, use of gzip, and amount of JavaScript. This information is recorded over time revealing trends in how the Internet is performing. Built using Open Source software, the code and data are available to everyone allowing researchers large and small to work from a common base.
https://legacy.httparchive.org
Other
328 stars 84 forks source link

Add a shorter timeout for fetches in custom metrics #192

Closed rviscomi closed 4 years ago

rviscomi commented 4 years ago

An asynchronous fetch in a custom metric could take ~30 seconds before timing out. Rather than wait for the promise to reject, race the fetch against a shorter timeout of ~5-10 seconds and resolve the promise at the sooner of the two async events.

This would help ensure that the custom metrics don't interfere as much with the overall crawl rate, as 30 seconds 7 million URLs 2 runs per client (desktop, mobile) definitely adds up.

Here are the instances of fetch in the custom metrics: https://github.com/search?q=fetch+repo%3AHTTPArchive%2Flegacy.httparchive.org+path%3Acustom_metrics&type=Code&ref=advsearch&l=&l=

Tiggerito commented 4 years ago

I'm happy to reduce it for the robotstxt one. If it takes 5 seconds to return a simple text file, I'd classify that as an error on its own.

rviscomi commented 4 years ago

Thanks @Tiggerito. Are you able to take all 3 files? Should be the same pattern in each.

Tiggerito commented 4 years ago

Thanks @Tiggerito. Are you able to take all 3 files? Should be the same pattern in each.

I'd have to research how to do it. This looks like a neat solution that could be put in a shared place. I think I saw that we can include js files?

https://www.lowmess.com/blog/fetch-with-timeout/

I could test it with my metric first?

rviscomi commented 4 years ago

Here's a prototype of the JS I had in mind:

fetch = new Promise(resolve => setTimeout(resolve, 30000, 'fetch'));
timeout = new Promise(resolve => setTimeout(resolve, 5000, 'timeout'));
Promise.race([fetch, timeout]).then(value => console.log(value));

Shouldn't need to include external JS to do it.

Tiggerito commented 4 years ago

I tested using this (from the article I referenced) in WebPageTest and it worked well:

const fetchWithTimeout = (uri, options = {}, time = 5000) => {
  const controller = new AbortController()
  const config = { ...options, signal: controller.signal }
  setTimeout(() => {
    controller.abort()
  }, time)
  return fetch(uri, config)
    .then((response) => {
      if (!response.ok) {
        throw new Error(`${response.status}: ${response.statusText}`)
      }
      return response
    })
    .catch((error) => {
      if (error.name === 'AbortError') {
        throw new Error('Response timed out')
      }
      throw new Error(error.message)
    })
}

If I set a small timeout it returns:

{"message":"Response timed out","error":{}}

Which we could easily alter. Do we have a standard thing to return when custom metrics fail?

One advantage of this pattern is that it cancels the request on timeout, so no risk of having forgotten requests continuing to be processed.

It's also easy to plug in. Add the code and change the fetch(url) to a fetchWithTimeout(url), and it works.

rviscomi commented 4 years ago

Well not to play favorites (I'm totally playing favorites 😁) but the Promise approach can also be implemented as a fetchWithTimeout function and is much simpler:

function fetchWithTimeout(url) {
  var network = fetch(url);
  var timeout = new Promise(resolve => setTimeout(resolve, 5000, 'timeout'));
  return Promise.race([network, timeout]).then(r => {
    if (r == 'timeout') return Promise.reject(r);
    return r;
  });
}
Tiggerito commented 4 years ago

Now I understand promises more 😀

I'll raise your simplification:

function fetchWithTimeout(url) {
  var controller = new AbortController();
  setTimeout(() => {controller.abort()}, 5000);
  return fetch(url, {signal: controller.signal});
}
rviscomi commented 4 years ago

Hey @Tiggerito sorry for the delay, your function LGTM. Are you able to apply that to each fetch instance? Hoping to get this in today before the October crawl starts.

Tiggerito commented 4 years ago

Looks like today is an HTTP Archive day. Will get onto it.

Tiggerito commented 4 years ago

Testing the code now.

third-parties.js contains a fetch but is auto generated code built by bin/library-detector.js using what looks like another repository. It looks like the fetch is used in relation to the serviceWorker. Not a trivial one to alter.

Only thing I can think of is to update the builder to include code that intercepts the fetch. Something like:

let originalFetch = fetch;

fetch = function(url, options) {
  var controller = new AbortController();
  setTimeout(() => {controller.abort()}, 5000);
  options.signal = controller.signal;
  return originalFetch(url, options);
}
rviscomi commented 4 years ago

These should be the only custom metrics with fetch: https://github.com/search?q=fetch+repo%3AHTTPArchive%2Flegacy.httparchive.org+path%3Acustom_metrics&type=Code&ref=advsearch&l=&l=

The code that generates the third parties script uses fetch, but it's not part of the custom metric code itself.

Tiggerito commented 4 years ago

Cool. Working on the last one now. sass.

rviscomi commented 4 years ago

Synced the HA server with the changes in #193 so this should take effect in the October crawl starting tomorrow. Thank you again for hopping on this @Tiggerito 🙏