apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.15k stars 637 forks source link

HttpCrawler fails request without retries on 403 response without any Content-Type #1994

Closed mvolfik closed 10 months ago

mvolfik commented 1 year ago

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Issue description

Run the example code. The endpoint returns 403 with just

HTTP/2 403 
content-length: 0

Crawlee fills default content-type application/octet-stream, and fails the request on https://github.com/apify/crawlee/blob/d453f9c6d224b67b2b2f99e7d4dc85b3ca71129b/packages/http-crawler/src/internals/http-crawler.ts#L744-L747

ERROR HttpCrawler: Request failed and reached maximum retries. Error: Resource https://ftp.mvolfik.com/403-no-content-type served Content-Type application/octet-stream, but only text/html, text/xml, application/xhtml+xml, application/xml, application/json are allowed. Skipping resource.
    at HttpCrawler._abortDownloadOfBody (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:541:19)
    at HttpCrawler.postNavigationHooks (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:242:45)
    at HttpCrawler._executeHooks (/tmp/amogus/node_modules/@crawlee/basic/internals/basic-crawler.js:900:23)
    at HttpCrawler._handleNavigation (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:337:20)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async HttpCrawler._runRequestHandler (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:287:13)
    at async wrap (/tmp/amogus/node_modules/@apify/timeout/index.js:52:21) {"id":"ey5j1V00zpDYPcb","url":"https://ftp.mvolfik.com/403-no-content-type","method":"GET","uniqueKey":"https://ftp.mvolfik.com/403-no-content-type"}

Expected behavior: do a standard request retry on 403 response.

Code sample

import { HttpCrawler } from "@crawlee/http";
const crawler = new HttpCrawler({requestHandler() {}});
await crawler.run(["https://ftp.mvolfik.com/403-no-content-type"]);

Package version

3.4.1

Node.js version

18.16.1

Operating system

Linux

Apify platform

I have tested this on the next release

No response

Other context

No response

foxt451 commented 1 year ago

Discussed this with @mvolfik , here is his message: also, what currently happens when we get a 403 response with disallowed content-type? for example, if some server was returning all 403 blocked responses as image/jpeg which isn't allowed in the crawler, but if we retry the request with new proxy to get a 200, we would get html as usually? this bug report might actually apply to this scenario as well, not sure

So the purpose I guess is to still block unsupported response types, but try retrying them, because it might be temporary

mvolfik commented 10 months ago

just ran into this again with Yelp. @foxt451 did you do any work on this, or can I take over?

foxt451 commented 10 months ago

just ran into this again with Yelp. @foxt451 did you do any work on this, or can I take over?

Hi, nope, if i remember correctly. You can take it