Open jo-sip opened 10 months ago
hi, please if any one did solve this problem it still presist even with the lates varsion, any thing will help really to get around this issue, the code is simple but i keep getting that headers-order.json file not found
export const collectLinksActions = async (url: string) => {
const crawler = new CheerioCrawler({
async requestHandler({ $, request, enqueueLinks, pushData, }) {
const title = $('title').text();
console.log(The title of "${request.url}" is: ${title}.
);
}
}
)
try {
await crawler.run(['https://www.assemblyai.com']);
} catch (error) { console.log({ error }) } };
INFO CheerioCrawler: Starting the crawler. WARN CheerioCrawler: Reclaiming failed request back to the list or queue. ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' Error: ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' {"id":"0pAwnIXW4YUDzrS","url":"https://www.assemblyai.com","retryCount":1} WARN CheerioCrawler: Reclaiming failed request back to the list or queue. ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' Error: ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' {"id":"0pAwnIXW4YUDzrS","url":"https://www.assemblyai.com","retryCount":2} WARN CheerioCrawler: Reclaiming failed request back to the list or queue. ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' Error: ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' {"id":"0pAwnIXW4YUDzrS","url":"https://www.assemblyai.com","retryCount":3} ERROR CheerioCrawler: Request failed and reached maximum retries. Error: ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' at Object.openSync (node:fs:581:18) at readFileSync (node:fs:457:35) at new HeaderGenerator (webpack-internal:///(rsc)/./node_modules/header-generator/header-generator.js:98:62) at eval (webpack-internal:///(rsc)/./node_modules/got-scraping/dist/index.js:168:17) at (rsc)/./node_modules/got-scraping/dist/index.js (/home/mhanda/true-assistant/.next/server/vendor-chunks/got-scraping.js:20:1) at Function.__webpack_require__ (/home/mhanda/true-assistant/.next/server/webpack-runtime.js:33:43) at async CheerioCrawler._requestFunction (webpack-internal:///(rsc)/./node_modules/@crawlee/http/internals/http-crawler.js:437:33) at async wrap (webpack-internal:///(rsc)/./node_modules/@apify/timeout/cjs/index.cjs:61:27) {"id":"0pAwnIXW4YUDzrS","url":"https://www.assemblyai.com","method":"GET","uniqueKey":"https://www.assemblyai.com"} INFO CheerioCrawler: All requests from the queue have been processed, the crawler will shut down. INFO CheerioCrawler: Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDurationMillis":4,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":5,"requestTotalDurationMillis":4,"requestsTotal":1,"crawlerRuntimeMillis":10160} INFO CheerioCrawler: Error analysis: {"totalErrors":1,"uniqueErrors":1,"mostCommonErrors":["1x: ENOENT: no such file or directory, open '/home/mhanda/true-assistant/.next/server/vendor-chunks/data_files/headers-order.json' (rsc)"]} INFO CheerioCrawler: Finished! Total 1 requests: 0 succeeded, 1 failed. {"terminal":true} ^[
Hi, here is my experience for building for Cloudflare Workers (for CheerioCrawler
):
build.ts
import { build } from "esbuild"
import { polyfillNode } from "esbuild-plugin-polyfill-node"
await build({
entryPoints: ["main.ts"],
bundle: true,
outfile: "build/main.js",
plugins: [polyfillNode({ polyfills: { crypto: true, fs: true } })],
platform: "browser",
external: ["jsdom", "node:path/win32", "@puppeteer", "puppeteer-core"],
})
Build errors:
✘ [ERROR] No matching export in "../node_modules/@jspm/core/nodelibs/browser/events.js" for import "errorMonitor"
../node_modules/@szmarczak/http-timer/dist/source/index.js:1:9:
1 │ import { errorMonitor } from 'events';
╵ ~~~~~~~~~~~~
✘ [ERROR] No matching export in "../node_modules/esbuild-plugin-polyfill-node/polyfills/empty.js" for import "V4MAPPED"
../node_modules/cacheable-lookup/source/index.js:2:1:
2 │ V4MAPPED,
╵ ~~~~~~~~
✘ [ERROR] No matching export in "../node_modules/esbuild-plugin-polyfill-node/polyfills/empty.js" for import "ADDRCONFIG"
../node_modules/cacheable-lookup/source/index.js:3:1:
3 │ ADDRCONFIG,
╵ ~~~~~~~~~~
✘ [ERROR] No matching export in "../node_modules/esbuild-plugin-polyfill-node/polyfills/empty.js" for import "ALL"
../node_modules/cacheable-lookup/source/index.js:4:1:
4 │ ALL,
╵ ~~~
✘ [ERROR] No matching export in "../node_modules/esbuild-plugin-polyfill-node/polyfills/empty.js" for import "promises"
../node_modules/cacheable-lookup/source/index.js:5:1:
5 │ promises as dnsPromises,
╵ ~~~~~~~~
✘ [ERROR] No matching export in "../node_modules/esbuild-plugin-polyfill-node/polyfills/empty.js" for import "lookup"
../node_modules/cacheable-lookup/source/index.js:6:1:
6 │ lookup as dnsLookup
╵ ~~~~~~
✘ [ERROR] No matching export in "../node_modules/esbuild-plugin-polyfill-node/polyfills/empty.js" for import "checkServerIdentity"
../node_modules/got-scraping/node_modules/got/dist/source/core/options.js:3:9:
3 │ import { checkServerIdentity } from 'node:tls';
╵ ~~~~~~~~~~~~~~~~~~~
✘ [ERROR] No matching export in "../node_modules/@jspm/core/nodelibs/browser/http.js" for import "ServerResponse"
../node_modules/got-scraping/node_modules/got/dist/source/core/index.js:4:15:
4 │ import http, { ServerResponse } from 'node:http';
╵ ~~~~~~~~~~~~~~
✘ [ERROR] No matching export in "../node_modules/@jspm/core/nodelibs/browser/http.js" for import "OutgoingMessage"
../node_modules/got-scraping/dist/index.js:17:9:
17 │ import { OutgoingMessage } from "node:http";
╵ ~~~~~~~~~~~~~~~
I'm not really experienced in esbuild
bundling so I'm curious if there is some fatal blocker or it's more like small obstacles that can be overcome to make crawlee
fully bundable.
platform: "browser",
I guess this itself won't help, but crawlee is a Node.js framework, you can't run it in a browser, so compiling it with platform: 'browser' seems just wrong.
Cloudflare Workers
Cloudflare workers don't run on native Node.js, so that's another possible problem. See https://developers.cloudflare.com/workers/runtime-apis/nodejs/
@B4nan
Thanks! So the idea here is that esbuild
needs to generate a single script that can be run on the browser/edge/worker environments. As it uses esbuild-plugin-polyfill-node
it's meant to complete polyfill all the required Node API. It doesn't work at the moment but I'm not sure if there is a critical not polyfiable dependency (the reason why it conceptually can't work on the edge).
crawlee
is for sure built for Node but it's also the case for many others libraries that are nowadays run in edge environments via polyfilling and some compat layers. Currenlty, Cloudflare removed script duration limitations and changed the pricing model to charge only compute time that feels really great for crawling tasks so I tried to migrate our project there (not successfully atm though).
BTW, Cloudflare terminology is quite confusing. They have two modes:
node-compat
-- they polyfill Node API on the build step (similar to what I shared) (1)compatibility_flags "nodejs_compat"
-- they provide some very limited Node API in runtime (2)The (2) option won't every work with crawlee
I guess but the (1) one still might I hope at some point.
PS.
Regarding our pipeline migration to Cloudflare, I think crawlee
is now the only dependency we can't make work as we managed to use there Postgres, logging, etc. Yea, also polars
is cumbersome..
BTW my initial comment was a little bit off-topic as it probably needs another feature request issue (long-shot one), something like "Support @crawlee/cheerio
bundling for browser/edge environments".
Regarding @crawlee/cheerio
being bendable for Node runtime, we made it work with the following setup:
package.json
"scripts": {
"build": "npm run build:bundle && npm run build:copy",
"build:bundle": "vite-node build.ts",
# header-generator uses runtime file reading (see the initial issue)
"build:copy": "cp -r node_modules/header-generator/data_files -t build",
},
"dependencies": {
# downgraded because got-scraping@4 is dynamically imported (see the initial issue)
"@crawlee/cheerio": "3.5.8"
}
app.js
const crawler = new CheerioCrawler(
{... },
new Configuration({
persistStorage: false,
}),
})
)
build.ts
import { build } from "esbuild"
await build({
entryPoints: ["functions/*.ts"],
outdir: "build",
external: ["@azure/functions-core"],
inject: ["shim.ts"],
format: "esm",
bundle: true,
minify: true,
keepNames: true,
platform: "node",
target: "node20",
logLevel: "info",
})
shim.ts
import { createRequire } from "node:module"
import path from "node:path"
import url from "node:url"
globalThis.require = createRequire(import.meta.url)
globalThis.__filename = url.fileURLToPath(import.meta.url)
globalThis.__dirname = path.dirname(__filename)
With this or similar setup crawlee
is successfully built to one script and run e.g. on Azure Functions.
+1 Running into a similar issue with nextJS
WARN CheerioCrawler: Reclaiming failed request back to the list or queue. ENOENT: no such file or directory, open '{MY_APP}/.next/server/vendor-chunks/data_files/headers-order.json'
I'm running into this. Using cheerio crawler and bundling for node
.
I have errors coming from the playwright modules (missing jquery etc), which I don't use. But they are imported because I have to import some things from crawlee
and that brings in everything.
The problem would be greatly reduced if:
"sideEffects": false
was used throughout crawlee packages. Esbuild picks up on this.With these two things true, we'd have actual bundle shaking and be able to at least solve this incrementally for whatever is broken for each crawler bit by bit, and not in one go.
As for a short term fix, I'm probably gonna use pnpm patch
to get rid of the stuff I do not need.
+1 on this, getting similar errors with jquery while using @crawlee/puppeteer
.
I attempted the steps here: https://github.com/apify/crawlee/issues/2208#issuecomment-1987270051
This, along with some other manual copies for jquery and a few other deps, still did not resolve my bundling issues. I ended with a cryptic error looking for a utils.js
I couldn't track down.
+1 I also have the same problem when using esbuild.
However, just for anyone using AWS CDK and esbuild to deploy their crawlee lambda, you can use bundling.nodeModules
option to not bundle crawlee altogether (I know it's not a solution but in hope it helps someone to avoid frustration just like I did).
For example:
import { aws_lambda as lambda, aws_lambda_nodejs as nodejs } from "aws-cdk-lib";
const crawleeFn = new nodejs.NodejsFunction(this, "CrawleeLambda", {
runtime: lambda.Runtime.NODEJS_18_X,
handler: "lambdaHandler",
entry: "./../src/main.ts",
bundling: {
nodeModules: ["crawlee"],
externalModules: ["@aws-sdk/*", "aws-lambda", "stream"],
}
});
+1 nisso, recebendo erros semelhantes com jquery ao usar
@crawlee/puppeteer
.Eu tentei os passos aqui: #2208 (comentário)
This, along with some other manual copies for jquery and a few other deps, still did not resolve my bundling issues. I ended with a cryptic error looking for a
utils.js
I couldn't track down.
I'm also having a problem with jquery when trying to run my application on aws lambda
Which package is the feature request for? If unsure which one to select, leave blank
@crawlee/cheerio (CheerioCrawler)
Feature
I've been trying to deploy a scraper on aws lambda via bundling with esbuild, but I keep getting running into errors with modules. https://github.com/apify/crawlee/discussions/2199
I've tried importing from
@crawlee/cheerio
, and a browser isn't used so I figured this would be quite doable but there have been challenges. I've also tried a few different permutations of trying to set persistStorage off, as I thought that could be an issue with the header-generator error below.E.g
npx esbuild ./src/aws/lib/lambda/scraper.handler.ts --bundle --outfile=./src/aws/lib/lambda/scraper.js --platform=node --keep-names
run file in node
They culprit for this one seems to be a lazy loaded gotScraper, but I don't know why the headers file is trying to be read. Minifying it makes it ~2.1 MB which isn't bad, better than a layer!
Motivation
Ideal solution or implementation, and any additional constraints
It would make deployment way easier for consumers of crawlee, especially on serverless infra. I'm not an expert at bundling or writing js libs, but I don't think it's too far off from being bundle-able. Seems like lazy loading might be an issue, as well as clearly labelling what packages should be imported from (i.e importing from 'crawlee' vs '@crawlee/cheerio' pulls in puppeteer deps, jsdom and jquery. (see link in description))
e.g
These aren't needed for Cheerio I believe.
Alternative solutions or implementations
No response
Other context
No response