apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.5k stars 665 forks source link

Support for crawling from secondary IP address #2409

Open teammakdi opened 7 months ago

teammakdi commented 7 months ago

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Feature

Hi, I see with both HttpCrawler and PuppeteerCrawler, ProxyConfiguration is supported which needs a HTTP proxy server. However my use case is to use the secondary IP address for crawling purposes.

Motivation

Raw axios supports requesting from a secondary IP address present on the machine. Example


const httpsAgent = new https.Agent({
    localAddress: 'x.x.x.x',
    localPort: xxxx
});

await axios.get('https://api.ipify.org', {
  httpsAgent
})
.then(response => {
  console.log('HTTPS Agent: ', response.data); // prints secondary IP address
})
.catch(err => {
    console.error(err);
})

Was wondering if it could be possible with the crawlee HttpCrawler i.e. with got library. Not sure if it would be feasible with the PuppeteerCrawler.

Ideal solution or implementation, and any additional constraints

-

Alternative solutions or implementations

No response

Other context

No response

teammakdi commented 6 months ago

For http crawler, this was relatively easy.

preNavigationHooks: [
        async (crawlingContext, gotOptions) => {
            gotOptions.localAddress = secondaryIpAddress
        }
    ]

Setting gotOptions.localAddress works.

Still looking out for PuppeteerCrawler

I was able to work it out with squid proxy by creating a http proxy server, however was looking with direct secondary IP based approaches.