Support for crawling from secondary IP address

apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Apache License 2.0

15.5k stars 665 forks source link

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Feature

Hi, I see with both HttpCrawler and PuppeteerCrawler, ProxyConfiguration is supported which needs a HTTP proxy server. However my use case is to use the secondary IP address for crawling purposes.

Motivation

Raw axios supports requesting from a secondary IP address present on the machine. Example


const httpsAgent = new https.Agent({
    localAddress: 'x.x.x.x',
    localPort: xxxx
});

await axios.get('https://api.ipify.org', {
  httpsAgent
})
.then(response => {
  console.log('HTTPS Agent: ', response.data); // prints secondary IP address
})
.catch(err => {
    console.error(err);
})

Was wondering if it could be possible with the crawlee HttpCrawler i.e. with got library. Not sure if it would be feasible with the PuppeteerCrawler.

apify / crawlee