HttpCrawler fails request without retries on 403 response without any Content-Type

mvolfik commented 1 year ago

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Issue description

Run the example code. The endpoint returns 403 with just

HTTP/2 403 
content-length: 0

Crawlee fills default content-type application/octet-stream, and fails the request on https://github.com/apify/crawlee/blob/d453f9c6d224b67b2b2f99e7d4dc85b3ca71129b/packages/http-crawler/src/internals/http-crawler.ts#L744-L747

ERROR HttpCrawler: Request failed and reached maximum retries. Error: Resource https://ftp.mvolfik.com/403-no-content-type served Content-Type application/octet-stream, but only text/html, text/xml, application/xhtml+xml, application/xml, application/json are allowed. Skipping resource.
    at HttpCrawler._abortDownloadOfBody (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:541:19)
    at HttpCrawler.postNavigationHooks (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:242:45)
    at HttpCrawler._executeHooks (/tmp/amogus/node_modules/@crawlee/basic/internals/basic-crawler.js:900:23)
    at HttpCrawler._handleNavigation (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:337:20)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async HttpCrawler._runRequestHandler (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:287:13)
    at async wrap (/tmp/amogus/node_modules/@apify/timeout/index.js:52:21) {"id":"ey5j1V00zpDYPcb","url":"https://ftp.mvolfik.com/403-no-content-type","method":"GET","uniqueKey":"https://ftp.mvolfik.com/403-no-content-type"}

Expected behavior: do a standard request retry on 403 response.

Code sample

import { HttpCrawler } from "@crawlee/http";
const crawler = new HttpCrawler({requestHandler() {}});
await crawler.run(["https://ftp.mvolfik.com/403-no-content-type"]);

Package version

3.4.1

Node.js version

18.16.1

Operating system

Linux

Apify platform

[X] Tick me if you encountered this issue on the Apify platform

I have tested this on the `next` release

No response

Other context

No response

foxt451 commented 1 year ago

Discussed this with @mvolfik , here is his message: also, what currently happens when we get a 403 response with disallowed content-type? for example, if some server was returning all 403 blocked responses as image/jpeg which isn't allowed in the crawler, but if we retry the request with new proxy to get a 200, we would get html as usually? this bug report might actually apply to this scenario as well, not sure

So the purpose I guess is to still block unsupported response types, but try retrying them, because it might be temporary

mvolfik commented 10 months ago

just ran into this again with Yelp. @foxt451 did you do any work on this, or can I take over?

foxt451 commented 10 months ago

just ran into this again with Yelp. @foxt451 did you do any work on this, or can I take over?

Hi, nope, if i remember correctly. You can take it

apify / crawlee