apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.33k stars 657 forks source link

use AbortController / AbortSignal for async code #1102

Closed pocesar closed 2 years ago

pocesar commented 3 years ago

Describe the feature Use AbortController when executing two promises usually a timeout with a promise that might never finish and/or finish unpredictably in the future. This might also help with Node 15+ forcefully crashing the process on unhandled promises exceptions, since when the 'stuck' promise decides to work, the state is already different and can cause undefined behavior.

Motivation As we are moving to Node 15+, it's time to properly address the race conditions that happen when using async code. At the moment, it either piles promises that never complete or just hope they don't execute, leads to very hard to debug racing conditions or stuck runs, the RequestQueue issue being a recurrent one (negative pending requests), before it's too late 😅 (I see you #1100).

Constraints Will require refactoring around some code that relies on dumb timeouts or intervals.

B4nan commented 2 years ago

Made some progress on this, here is proposed way to handle aborting via AbortController that leverages AsyncLocalStorage to bare the abort controller instances inside the handler.

const storage = new AsyncLocalStorage();

export function tryCancel() {
  const { signal } = storage.getStore().cancelTask;

  if (signal.aborted) {
    // cancel the promise by throwing, this error will be ignored if the promise timed out already
    throw new Error();
  }
}

async function timeout(time, errorMessage) {
  try {
    await setTimeout(time, undefined, { signal: storage.getStore().cancelTimeout.signal });
    storage.getStore().cancelTask.abort();
  } catch {
    // ignore rejections (task finished and timeout has been cancelled)
    return;
  }

  throw new Error(errorMessage); // timeout error that will be shown
}

async function addTimeoutToPromise(handler, time, error) {
  const cancelTimeout = new AbortController();
  const cancelTask = new AbortController();

  const wrap = async () => {
    try {
      await handler();
    } finally {
      storage.getStore().cancelTimeout.abort();
    }
  };

  await new Promise((resolve, reject) => {
    storage.run({ cancelTimeout, cancelTask }, () => {
      Promise.race([
        timeout(time, error),
        wrap(),
      ]).then(resolve, reject);
    });
  });
}

In practice we would need to only add the tryCancel() calls after each await in the handler, rest should be handled automatically.

async function doSomething() {
  await setTimeout(200); // do something async 
  tryCancel(); // check if we are aborted

  await setTimeout(200); // do something async 
  tryCancel(); // check if we are aborted

  await setTimeout(200); // do something async 
  tryCancel(); // check if we are aborted

  await setTimeout(200); // do something async 
  tryCancel(); // check if we are aborted
}

await addTimeoutToPromise(
  // we need to pass callbacks, not raw promises, so they are executed after the ALS context is created
  () => doSomething(),
  1000,
  'timed out',
);

If all we care about is cancelling the handler. Sometimes we might want to do some cleanup, so the tryCancel could also have a callback parameter.

Testing script here. Will try to incorporate this into the crawlers now.

edit: it needs few smaller improvements, but looks like it is working! (testing with the repro from #1216). also we might need to actually return instead of throwing via tryCancel

B4nan commented 2 years ago

Actually throwing works fine, just need to catch that particular error in the wrap handler. This will need to be handled in the browser pool too, so we can cancel as early as possible. We will need a shared ALS instance (for both SDK and browser pool, and possibly other packages once we have it split in the monorepo), so will make this a new package in the apify-shared repo first.

Got it hacked locally and with that repro, it does not even try to open the browser due to the timeout (and keeps retrying without any other error logs), but it literally means we need to do the tryCancel() after every single await on given path.

B4nan commented 2 years ago

https://github.com/apify/apify-shared-js/pull/267

mnmkng commented 2 years ago

// we need to pass callbacks, not raw promises, so they are executed after the ALS context is created

Instead of changing this function, I would create a new one addTimeoutToAsyncFunction or something like that. It's good that we'll do this. I think @pocesar found an issue with the timeouts that the promise can get rejected in between the time where it's created and when the handlers are attached to it inside addTimeoutToPromise leading to unhandled rejections.

mnmkng commented 2 years ago

Btw, this would be a nice blog post. I don't think there were a lot of good resources about this when we first started looking into it.

B4nan commented 2 years ago

It will be a new package, @apify/timeout, I won't be changing the existing implementation of addTimeoutToPromise. I kinda like that name, if we are to pick something else to reduce confusion with the existing method, I would maybe go with something less verbose, like just addTimeout(). addTimeoutToAsyncFunction sounds quite verbose :D ~Btw the handler does not need to be async, the signature requires it but it will work fine with a regular function too.~ (edit: but that would obviously never allow timeouting this way)

Agreed on the blog post, definitely stuff worth sharing, weird that this combo is not mentioned in any articles about abort controller, I guess it's too cutting edge.