apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.31k stars 654 forks source link

Request headers not set in HTTP crawler #2108

Closed teammakdi closed 1 year ago

teammakdi commented 1 year ago

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Issue description

Request headers are not being set in HTTP crawler even if useHeaderGenerator set to true.

Not sure if missing some config.

Code sample

const { HttpCrawler, Configuration } = require('crawlee')

    const config = Configuration.getGlobalConfig()
    config.set('persistStorage', false)

    const crawler = new HttpCrawler({
    requestHandlerTimeoutSecs: 30,
    async requestHandler ({ request, body, response }) {
        // prints {}
        console.log(JSON.stringify(request.headers))
    },
    preNavigationHooks: [
        async (crawlingContext, gotOptions) => {
        gotOptions.timeout = {
            request: 10000
        }
        gotOptions.useHeaderGenerator = true
        gotOptions.headerGeneratorOptions = {
            browsers: ['chrome'],
            devices: ['desktop'],
            operatingSystems: ['linux'],
            locales: ['en-US', 'en']
        }
        }
    ],

    failedRequestHandler: async ({ request, enqueueLinks }, err) => {
        console.error(err)
    },
    errorHandler: async ({ request, response }, err) => {
        console.error(err)
    }
    })

    crawler.run([
    'https://google.com'
    ])

Package version

3.5.4

Node.js version

18.17.0

Operating system

Mac OS

Apify platform

I have tested this on the next release

No response

Other context

The above code is just a POC code and not used in prod environments.

B4nan commented 1 year ago

Duplicate of #1964, if you want to access the generated headers, use response.request instead.

teammakdi commented 1 year ago

Created a simple web proxy to check if headers are being sent, could verify that headers are being set.

Thanks @B4nan