apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
14.85k stars 618 forks source link

Add support for Bun runtime - Issue with `browser-pool` and `memory-storage` packages #2046

Open JuroOravec opened 1 year ago

JuroOravec commented 1 year ago

Which package is the feature request for? If unsure which one to select, leave blank

None

Feature

Hi, I'm cross-posting the notes from https://github.com/apify/proxy-chain/issues/521 (below), where I tried to run an Apify/Crawlee Playwright scraper in Bun runtime.

TL;DR - Currently it's not possible to run Apify/Crawlee scrapers with Bun. There are (at least) 2 unsupported features on Crawlee side, and (at least) 1 error on Playwright side.


  1. There was an error with @crawlee/browser-pool/proxy-server.js with line

    server.server.unref();

    I looked into it. The unref should refers to http.Server.unref. For some reason, this isn't define in Bun, and this seems to be genuine error on their side (it's not even reported in their docs).

  2. Out of curiosity, I just commented out that line, to see if I get the crawler to work. It printed the initial log with system info

    INFO  System info 
    {"apifyVersion":"3.1.4","apifyClientVersion":"2.7.1","crawleeVersion":"3.3.1","osType":"Darwin","nodeVersion":"v18.15.0"}

    However, the run still ended in an error. Here, the promises_1.opendir refer to fs.promises.opendir (node:fs). Unfortunately, none of the opendir functions are currently defined Bun (fs.opendirSync, fs.opendir, fs.promises.opendir).

    ERROR (0, promises_1.opendir) is not a function. (In '(0, promises_1.opendir)(keyValueStoreDir)', '(0, promises_1.opendir)' is undefined)
      TypeError: (0, promises_1.opendir) is not a function. (In '(0, promises_1.opendir)(keyValueStoreDir)', '(0, promises_1.opendir)' is undefined)
          at <anonymous> (/Users/presenter/repos/apify-actor-facebook/node_modules/@crawlee/memory-storage/cache-helpers.js:110:25)
  3. I managed to get start a Playwright crawler in Bun with following changes to the Apify packages:

    • I commented out the server.server.unref(); in @crawlee/browser-pool/proxy-server.js
    • I replaced fs.promises.opendir(dirName) with fs.promises.readdir(dirName, { withFileTypes: true }) in @crawlee/memory-storage/cache-helpers.js
      • NOTE: Good thing is that with the withFileTypes: true option, both opendir and readdir resolve to an iterable of Dirent. Bad thing, from my understanding opendir yields the entries one-by-one as they are found, whereas readdir resolves only once all items have been found. So replacing opendir with readdir might add extra waiting time.
  4. With changes in step 3., I managed to start a Playwright crawler, to the point where Playwright command was executed. Afterwards, there is an issue on Playwright side with child_process.spawn. You can find more about that issue here:

Motivation

Make Crawlee scrapers more performant by using Bun runtime instead of Node.

Ideal solution or implementation, and any additional constraints

Be able to run crawlee scrapers with Bun. However, Bun is still experimental, so this is a slow-burner.

Alternative solutions or implementations

No response

Other context

No response

B4nan commented 1 year ago

We'd definitely want to support bun (as well as deno) at some point, but as you already pointed out, it will be mostly about them providing the missing APIs rather than us changing something.

Also, keep in mind that the speed difference will be most probably not measurable when it comes to the actual scraping - the slowness is coming from the network traffic (doing requests) and proxy usage, not from slow JavaScript execution.

colinhacks commented 11 months ago

As noted, this is an issue with the Bun runtime, feel free to close @B4nan 👍

This can be tracked here: https://github.com/oven-sh/bun/issues/5606

danielgwilson commented 10 months ago

+1