apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.21k stars 641 forks source link

Support bundling #2208

Open jo-sip opened 10 months ago

jo-sip commented 10 months ago

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/cheerio (CheerioCrawler)

Feature

I've been trying to deploy a scraper on aws lambda via bundling with esbuild, but I keep getting running into errors with modules. https://github.com/apify/crawlee/discussions/2199

I've tried importing from @crawlee/cheerio, and a browser isn't used so I figured this would be quite doable but there have been challenges. I've also tried a few different permutations of trying to set persistStorage off, as I thought that could be an issue with the header-generator error below.

E.g

import { Configuration } from "@crawlee/basic";
import { CheerioCrawler } from "@crawlee/cheerio";
import type { APIGatewayProxyResult } from "aws-lambda";

const startUrls = ["https://www.google.com"];

export const handler = async (): Promise<APIGatewayProxyResult> => {
  Configuration.set("persistStorage", false);

  const crawler = new CheerioCrawler(
    {
      // requestHandler: router,
      maxRequestsPerCrawl: 1,
    },
    new Configuration({
      persistStorage: false,
    }),
  );

  await crawler.run(startUrls);

  return {
    statusCode: 200,
    body: "Success",
  };
};

npx esbuild ./src/aws/lib/lambda/scraper.handler.ts --bundle --outfile=./src/aws/lib/lambda/scraper.js --platform=node --keep-names

run file in node

INFO  CheerioCrawler: Starting the crawler.
WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. ENOENT: no such file or directory, open '[path_to_data_files]/headers-order.json'
Error: ENOENT: no such file or directory, open '[path_to_data_files]/headers-order.json' {"id":"[request_id]","url":"https://www.google.com","retryCount":1}
[Scraper_File]:[Line_Number]
      return gotScraping2(...args);
             ^
TypeError: gotScraping2 is not a function
    at gotScraping ([Scraper_File]:[Line_Number_at_function])
    at async [Scraper_File]:[Line_Number_at_async_call]

They culprit for this one seems to be a lazy loaded gotScraper, but I don't know why the headers file is trying to be read. Minifying it makes it ~2.1 MB which isn't bad, better than a layer!

Motivation

Ideal solution or implementation, and any additional constraints

It would make deployment way easier for consumers of crawlee, especially on serverless infra. I'm not an expert at bundling or writing js libs, but I don't think it's too far off from being bundle-able. Seems like lazy loading might be an issue, as well as clearly labelling what packages should be imported from (i.e importing from 'crawlee' vs '@crawlee/cheerio' pulls in puppeteer deps, jsdom and jquery. (see link in description))

e.g

▲ [WARNING] "puppeteer/package.json" should be marked as external for use with "require.resolve" [require-resolve-not-external]

    node_modules/@crawlee/puppeteer/internals/utils/puppeteer_utils.js:190:37:
      190 │     const jsonPath = require.resolve('puppeteer/package.json');
          ╵                                      ~~~~~~~~~~~~~~~~~~~~~~~~

▲ [WARNING] "jquery" should be marked as external for use with "require.resolve" [require-resolve-not-external]

    node_modules/@crawlee/playwright/internals/utils/playwright-utils.js:35:35:
      35 │ const jqueryPath = require.resolve('jquery');
         ╵                                    ~~~~~~~~

▲ [WARNING] "./xhr-sync-worker.js" should be marked as external for use with "require.resolve" [require-resolve-not-external]

    node_modules/jsdom/lib/jsdom/living/xhr/XMLHttpRequest-impl.js:31:57:
      31 │ const syncWorkerFile = require.resolve ? require.resolve("./xhr-sync-worker.js") : null;
         ╵                                                          ~~~~~~~~~~~~~~~~~~~~~~

These aren't needed for Cheerio I believe.

Alternative solutions or implementations

No response

Other context

No response

mmhanda commented 8 months ago

hi, please if any one did solve this problem it still presist even with the lates varsion, any thing will help really to get around this issue, the code is simple but i keep getting that headers-order.json file not found

export const collectLinksActions = async (url: string) => { const crawler = new CheerioCrawler({ async requestHandler({ $, request, enqueueLinks, pushData, }) { const title = $('title').text(); console.log(The title of "${request.url}" is: ${title}.); } } ) try { await crawler.run(['https://www.assemblyai.com']);

} catch (error) { console.log({ error }) } };

INFO CheerioCrawler: Starting the crawler. WARN CheerioCrawler: Reclaiming failed request back to the list or queue. ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' Error: ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' {"id":"0pAwnIXW4YUDzrS","url":"https://www.assemblyai.com","retryCount":1} WARN CheerioCrawler: Reclaiming failed request back to the list or queue. ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' Error: ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' {"id":"0pAwnIXW4YUDzrS","url":"https://www.assemblyai.com","retryCount":2} WARN CheerioCrawler: Reclaiming failed request back to the list or queue. ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' Error: ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' {"id":"0pAwnIXW4YUDzrS","url":"https://www.assemblyai.com","retryCount":3} ERROR CheerioCrawler: Request failed and reached maximum retries. Error: ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' at Object.openSync (node:fs:581:18) at readFileSync (node:fs:457:35) at new HeaderGenerator (webpack-internal:///(rsc)/./node_modules/header-generator/header-generator.js:98:62) at eval (webpack-internal:///(rsc)/./node_modules/got-scraping/dist/index.js:168:17) at (rsc)/./node_modules/got-scraping/dist/index.js (/home/mhanda/true-assistant/.next/server/vendor-chunks/got-scraping.js:20:1) at Function.__webpack_require__ (/home/mhanda/true-assistant/.next/server/webpack-runtime.js:33:43) at async CheerioCrawler._requestFunction (webpack-internal:///(rsc)/./node_modules/@crawlee/http/internals/http-crawler.js:437:33) at async wrap (webpack-internal:///(rsc)/./node_modules/@apify/timeout/cjs/index.cjs:61:27) {"id":"0pAwnIXW4YUDzrS","url":"https://www.assemblyai.com","method":"GET","uniqueKey":"https://www.assemblyai.com"} INFO CheerioCrawler: All requests from the queue have been processed, the crawler will shut down. INFO CheerioCrawler: Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDurationMillis":4,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":5,"requestTotalDurationMillis":4,"requestsTotal":1,"crawlerRuntimeMillis":10160} INFO CheerioCrawler: Error analysis: {"totalErrors":1,"uniqueErrors":1,"mostCommonErrors":["1x: ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' (rsc)"]} INFO CheerioCrawler: Finished! Total 1 requests: 0 succeeded, 1 failed. {"terminal":true} ^[

roll commented 7 months ago

Hi, here is my experience for building for Cloudflare Workers (for CheerioCrawler):

build.ts

import { build } from "esbuild"
import { polyfillNode } from "esbuild-plugin-polyfill-node"

await build({
  entryPoints: ["main.ts"],
  bundle: true,
  outfile: "build/main.js",
  plugins: [polyfillNode({ polyfills: { crypto: true, fs: true } })],
  platform: "browser",
  external: ["jsdom", "node:path/win32", "@puppeteer", "puppeteer-core"],
})

Build errors:

✘ [ERROR] No matching export in "../node_modules/@jspm/core/nodelibs/browser/events.js" for import "errorMonitor"

    ../node_modules/@szmarczak/http-timer/dist/source/index.js:1:9:
      1 │ import { errorMonitor } from 'events';
        ╵          ~~~~~~~~~~~~

✘ [ERROR] No matching export in "../node_modules/esbuild-plugin-polyfill-node/polyfills/empty.js" for import "V4MAPPED"

    ../node_modules/cacheable-lookup/source/index.js:2:1:
      2 │   V4MAPPED,
        ╵   ~~~~~~~~

✘ [ERROR] No matching export in "../node_modules/esbuild-plugin-polyfill-node/polyfills/empty.js" for import "ADDRCONFIG"

    ../node_modules/cacheable-lookup/source/index.js:3:1:
      3 │   ADDRCONFIG,
        ╵   ~~~~~~~~~~

✘ [ERROR] No matching export in "../node_modules/esbuild-plugin-polyfill-node/polyfills/empty.js" for import "ALL"

    ../node_modules/cacheable-lookup/source/index.js:4:1:
      4 │   ALL,
        ╵   ~~~

✘ [ERROR] No matching export in "../node_modules/esbuild-plugin-polyfill-node/polyfills/empty.js" for import "promises"

    ../node_modules/cacheable-lookup/source/index.js:5:1:
      5 │   promises as dnsPromises,
        ╵   ~~~~~~~~

✘ [ERROR] No matching export in "../node_modules/esbuild-plugin-polyfill-node/polyfills/empty.js" for import "lookup"

    ../node_modules/cacheable-lookup/source/index.js:6:1:
      6 │   lookup as dnsLookup
        ╵   ~~~~~~

✘ [ERROR] No matching export in "../node_modules/esbuild-plugin-polyfill-node/polyfills/empty.js" for import "checkServerIdentity"

    ../node_modules/got-scraping/node_modules/got/dist/source/core/options.js:3:9:
      3 │ import { checkServerIdentity } from 'node:tls';
        ╵          ~~~~~~~~~~~~~~~~~~~

✘ [ERROR] No matching export in "../node_modules/@jspm/core/nodelibs/browser/http.js" for import "ServerResponse"

    ../node_modules/got-scraping/node_modules/got/dist/source/core/index.js:4:15:
      4 │ import http, { ServerResponse } from 'node:http';
        ╵                ~~~~~~~~~~~~~~

✘ [ERROR] No matching export in "../node_modules/@jspm/core/nodelibs/browser/http.js" for import "OutgoingMessage"

    ../node_modules/got-scraping/dist/index.js:17:9:
      17 │ import { OutgoingMessage } from "node:http";
         ╵          ~~~~~~~~~~~~~~~

I'm not really experienced in esbuild bundling so I'm curious if there is some fatal blocker or it's more like small obstacles that can be overcome to make crawlee fully bundable.

B4nan commented 7 months ago

platform: "browser",

I guess this itself won't help, but crawlee is a Node.js framework, you can't run it in a browser, so compiling it with platform: 'browser' seems just wrong.

Cloudflare Workers

Cloudflare workers don't run on native Node.js, so that's another possible problem. See https://developers.cloudflare.com/workers/runtime-apis/nodejs/

roll commented 7 months ago

@B4nan Thanks! So the idea here is that esbuild needs to generate a single script that can be run on the browser/edge/worker environments. As it uses esbuild-plugin-polyfill-node it's meant to complete polyfill all the required Node API. It doesn't work at the moment but I'm not sure if there is a critical not polyfiable dependency (the reason why it conceptually can't work on the edge).

crawlee is for sure built for Node but it's also the case for many others libraries that are nowadays run in edge environments via polyfilling and some compat layers. Currenlty, Cloudflare removed script duration limitations and changed the pricing model to charge only compute time that feels really great for crawling tasks so I tried to migrate our project there (not successfully atm though).

BTW, Cloudflare terminology is quite confusing. They have two modes:

The (2) option won't every work with crawlee I guess but the (1) one still might I hope at some point.

PS. Regarding our pipeline migration to Cloudflare, I think crawlee is now the only dependency we can't make work as we managed to use there Postgres, logging, etc. Yea, also polars is cumbersome..

roll commented 6 months ago

BTW my initial comment was a little bit off-topic as it probably needs another feature request issue (long-shot one), something like "Support @crawlee/cheerio bundling for browser/edge environments".

Regarding @crawlee/cheerio being bendable for Node runtime, we made it work with the following setup:

package.json

  "scripts": {
    "build": "npm run build:bundle && npm run build:copy",
    "build:bundle": "vite-node build.ts",
    # header-generator uses runtime file reading (see the initial issue)
    "build:copy": "cp -r node_modules/header-generator/data_files -t build",
  },
"dependencies": {
  # downgraded because got-scraping@4 is dynamically imported (see the initial issue)
  "@crawlee/cheerio": "3.5.8" 
}

app.js

  const crawler = new CheerioCrawler(
    {... },
    new Configuration({
      persistStorage: false,
     }),
    })
  )

build.ts

import { build } from "esbuild"

await build({
  entryPoints: ["functions/*.ts"],
  outdir: "build",
  external: ["@azure/functions-core"],
  inject: ["shim.ts"],
  format: "esm",
  bundle: true,
  minify: true,
  keepNames: true,
  platform: "node",
  target: "node20",
  logLevel: "info",
})

shim.ts

import { createRequire } from "node:module"
import path from "node:path"
import url from "node:url"

globalThis.require = createRequire(import.meta.url)
globalThis.__filename = url.fileURLToPath(import.meta.url)
globalThis.__dirname = path.dirname(__filename)

With this or similar setup crawlee is successfully built to one script and run e.g. on Azure Functions.

KiaFathi commented 3 months ago

+1 Running into a similar issue with nextJS

WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. ENOENT: no such file or directory, open '{MY_APP}/.next/server/vendor-chunks/data_files/headers-order.json'
adamscybot commented 2 months ago

I'm running into this. Using cheerio crawler and bundling for node.

I have errors coming from the playwright modules (missing jquery etc), which I don't use. But they are imported because I have to import some things from crawlee and that brings in everything.

The problem would be greatly reduced if:

  1. "sideEffects": false was used throughout crawlee packages. Esbuild picks up on this.
  2. Crawlee was using native ESM modules. Much to my surprise the "ESM" support is actually just a wrapper that calls the commonjs modules.

With these two things true, we'd have actual bundle shaking and be able to at least solve this incrementally for whatever is broken for each crawler bit by bit, and not in one go.

As for a short term fix, I'm probably gonna use pnpm patch to get rid of the stuff I do not need.

Patrick-Erichsen commented 1 month ago

+1 on this, getting similar errors with jquery while using @crawlee/puppeteer.

I attempted the steps here: https://github.com/apify/crawlee/issues/2208#issuecomment-1987270051

This, along with some other manual copies for jquery and a few other deps, still did not resolve my bundling issues. I ended with a cryptic error looking for a utils.js I couldn't track down.

Knight-H commented 1 month ago

+1 I also have the same problem when using esbuild.

However, just for anyone using AWS CDK and esbuild to deploy their crawlee lambda, you can use bundling.nodeModules option to not bundle crawlee altogether (I know it's not a solution but in hope it helps someone to avoid frustration just like I did).

For example:

import {  aws_lambda as lambda,  aws_lambda_nodejs as nodejs } from "aws-cdk-lib";
const crawleeFn = new nodejs.NodejsFunction(this, "CrawleeLambda", {
    runtime: lambda.Runtime.NODEJS_18_X,
    handler: "lambdaHandler",
    entry: "./../src/main.ts",
    bundling: {
      nodeModules: ["crawlee"],
      externalModules: ["@aws-sdk/*", "aws-lambda", "stream"],
    }
});
rafaelslopes1 commented 3 days ago

+1 nisso, recebendo erros semelhantes com jquery ao usar @crawlee/puppeteer.

Eu tentei os passos aqui: #2208 (comentário)

This, along with some other manual copies for jquery and a few other deps, still did not resolve my bundling issues. I ended with a cryptic error looking for a utils.js I couldn't track down.

I'm also having a problem with jquery when trying to run my application on aws lambda