apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
14.84k stars 618 forks source link

`useIncognitoPages` doesn't rotate fingerprints #2310

Open mnmkng opened 7 months ago

mnmkng commented 7 months ago

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

If you run the code with incognito pages, you will always get the same browser. If you comment incognito pages and uncomment one page per browser, you will get different user agents.

Code sample

import { Actor } from "apify";
import { PlaywrightCrawler } from 'crawlee';

const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    browserPoolOptions: {
        useFingerprints: true,
        // maxOpenPagesPerBrowser: 1,
    },
    launchContext: {
        useIncognitoPages: true,
    },
    preNavigationHooks: [
        async ({ page }) => {
            page.once('request', async (req) => {
                try {
                    const headers = await req.allHeaders()
                    console.dir(headers);
                } catch (e) {
                    console.log('req inspection failed')
                }
            })
        }
    ],
    requestHandler: async ({ request, page, log}) => {
        const text = await page.innerText('pre');
        log.info(text);
    },
});

await crawler.run([
    'https://api.ipify.org?format=json&a',
    'https://api.ipify.org?format=json&b',
    'https://api.ipify.org?format=json&c',
    'https://api.ipify.org?format=json&d',
    'https://api.ipify.org?format=json&e',
    'https://api.ipify.org?format=json&f',
]);

Package version

3.7.2

Node.js version

18

Operating system

MacOS

Apify platform

I have tested this on the next release

No response

Other context

No response

barjin commented 6 months ago

Seems like a sign of a much larger underlying issue:

New sessions / fingerprints / proxyUrls are generated only on a browser launch.

The following snippet doesn't rotate the fingerprints correctly - all requests are done with one session only. This is because the useIncognitoPages was written with Playwright contexts in mind - we relied on the "newPage() creates a separate environment" invariant, so all the pages/contexts are launched in one browser.

sessionPoolOptions: {
    sessionOptions: {
        maxUsageCount: 1,
    },
},
launchContext: {
   useIncognitoPages: true,
},

The following snippet rotates the fingerprints correctly:

sessionPoolOptions: {
    sessionOptions: {
        maxUsageCount: 1,
    },
},
launchContext: {
   useIncognitoPages: false,
},

This works well because an "expired" session throws away the whole browser instance, causing the new pages to launch a whole new browser (see the parallel with the maxOpenPagesPerBrowser, which does the same thing). This is crazy expensive though, while launching and closing a context 100 times in one browser takes ~3.9 seconds, launching and closing a browser 100 times takes 40 seconds.

The entire browser-pool and session rotation logic is quite convoluted and worth a total rewrite.