Rotate proxies per context

mnmkng commented 2 years ago

Having to launch a new browser to switch a proxy is a huge performance overhead and possibly might bring other issues. Setting a proxy per page/context is now available in both Playwright and Puppeteer.

While looking into this feature I collected the following info:

We use pageOptions in code and examples, but in reality those pageOptions are contextOptions. Neither Puppeteer nor Playwright accept any options in context.newPage().
Puppeteer allows creation of incognito context with browser.createIncognitoBrowserContext(options). The default is persisted.
Playwright has contexts incognito by default when using browser.newContext(contextOptions) or browser.newPage(contextOptions) as a shortcut. We can opt-in to persisted contexts when launching the browser.
We have useIncognitoPages top level option, which sets the type of context.

Now, if I'm not wrong, we now have some kind of options for both libraries when creating contexts. And in neither of the libs there are any kind of page options. Therefore I suggest renaming pageOptions to contextOptions wherever we use them and use those to set the proxy per page.

We also already have prePageCreate hooks which expose the ~pageOptions~ contextOptions so we should be able to use those to set proxies per context.

In the end it only looks like we need to clarify the docs here and implement this on the SDK level. The crawlers should use prePageCreate hooks to set new proxy to the context.

Does it all make sense? cc @szmarczak @B4nan

mnmkng commented 2 years ago

Another thing to consider is that anonymizeProxy starts a separate server with each anonymization. This is problematic, because we could run out of ports eventually, so we must not forget to close the proxy when the context is closed. It might also be worth checking the performance overheads of starting the server for each context and if it's too large, we could use the built in username and password options of Playwright. Puppeteer does not support this, so no optimizations there.

szmarczak commented 2 years ago

Puppeteer allows creation of incognito context with browser.createIncognitoBrowserContext(options). The default is persisted.

This is not the case anymore: https://github.com/apify/browser-pool/pull/51

Therefore I suggest renaming pageOptions to contextOptions wherever we use them and use those to set the proxy per page.

:+1: I wonder if there are cases where multiple pages with a single context are required 🤔 (doubt it)

Puppeteer does not support this, so no optimizations there.

We can use page.authenticate.

mnmkng commented 2 years ago

I wonder if there are cases where multiple pages with a single context are required 🤔 (doubt it)

Yeah, not in crawlers I guess.

We can use page.authenticate.

Yeah we can try, but if you do, please do some perf tests.

szmarczak commented 2 years ago

Yeah we can try, but if you do, please do some perf tests.

Will do, I'm pretty sure it'll be faster since no need to create the server and there will be less TCP overhead.

szmarczak commented 2 years ago

Also there is a slight issue with puppeteer: for non-incognito contexts it must be launched with --proxy-server, there's no other way currently.

mnmkng commented 2 years ago

Then we'll have to live with it. I'm fine if the Puppeteer implementation lags behind Playwright a bit. Playwright is better and people should use that in most cases.

szmarczak commented 2 years ago

Merged https://github.com/apify/browser-pool/pull/53 however passing contextOptions isn't compatible with TypeScript yet (need to cast as any) as it expects newPage to be exactly like the original Puppeteer newPage function.

szmarczak commented 2 years ago

TypeScript support in https://github.com/apify/browser-pool/pull/54

szmarczak commented 2 years ago

It's now possible to do proxy per page via hooks: (browser-pool@3.0.3) https://github.com/apify/browser-pool/blob/c345766a7fabacd98bfaad1a5c13f4a3b0f8af12/test/browser-pool.test.ts#L412-L480

Playwright Chrome on Windows requires a global proxy server to be present even if it won't be used 🤷🏼‍♂️ So we have to use proxy-chain for this one. It's just one server per entire app.

szmarczak commented 2 years ago

implement this on the SDK level. The crawlers should use prePageCreate hooks to set new proxy to the context.

How should this look like in particular? Can we just expose prePageCreateHooks?

mnmkng commented 2 years ago

I would like the following code to return 6 different IPs. Now it works the same with useIncognitoPages: true|false. It always uses one browser and one IP.

In SDK v3 (~January 2022), we would like to switch useIncognitoPages: true as the default. Which will help with a lot of issues and will make it work exactly the same as CheerioCrawler, which also uses a different proxy for each request (unless a session is provided).

const Apify = require('apify');

Apify.main(async () => {
    const requestList = await Apify.openRequestList(null, [
        'https://api.apify.com/v2/browser-info?a=1',
        'https://api.apify.com/v2/browser-info?a=2',
        'https://api.apify.com/v2/browser-info?a=3',
        'https://api.apify.com/v2/browser-info?a=4',
        'https://api.apify.com/v2/browser-info?a=5',
        'https://api.apify.com/v2/browser-info?a=6',
    ]);

    const proxyConfiguration = await Apify.createProxyConfiguration();

    const ips = [];

    const crawler = new Apify.PlaywrightCrawler({
        requestList,
        proxyConfiguration,
        launchContext: {
            useIncognitoPages: true,
        },
        handlePageFunction: async ({ page }) => {
            const el = await page.$('pre');
            const json = await el.textContent();
            const { clientIp } = JSON.parse(json);
            ips.push(clientIp);
        },
    });

    await crawler.run();

    console.log('Used IPs:');
    console.dir(ips);
});

Implementation wise, the BrowserCrawler now ignores the session received from BasicCrawler and instead manages its own sessions, because we could not start a new browser for each Request. But now we can only start a new context, which should enable us to unify the implementation across all Crawlers.

I'm happy to have a call with you any time to explain the details if you want.

mnmkng commented 2 years ago

Btw, prePageCreateHooks are already exposed, you can use them like this:

const crawler = new Apify.PlaywrightCrawler({
        // ...
        browserPoolOptions: {
            prePageCreateHooks: [
                 async () => { /* hook */ }
            ]
        },     
    });

apify / browser-pool

Rotate proxies per context #52