Open mnmkng opened 2 years ago
Another thing to consider is that anonymizeProxy
starts a separate server with each anonymization. This is problematic, because we could run out of ports eventually, so we must not forget to close the proxy when the context is closed. It might also be worth checking the performance overheads of starting the server for each context and if it's too large, we could use the built in username and password options of Playwright. Puppeteer does not support this, so no optimizations there.
Puppeteer allows creation of incognito context with browser.createIncognitoBrowserContext(options). The default is persisted.
This is not the case anymore: https://github.com/apify/browser-pool/pull/51
Therefore I suggest renaming pageOptions to contextOptions wherever we use them and use those to set the proxy per page.
:+1: I wonder if there are cases where multiple pages with a single context are required 🤔 (doubt it)
Puppeteer does not support this, so no optimizations there.
We can use page.authenticate
.
I wonder if there are cases where multiple pages with a single context are required 🤔 (doubt it)
Yeah, not in crawlers I guess.
We can use
page.authenticate
.
Yeah we can try, but if you do, please do some perf tests.
Yeah we can try, but if you do, please do some perf tests.
Will do, I'm pretty sure it'll be faster since no need to create the server and there will be less TCP overhead.
Also there is a slight issue with puppeteer: for non-incognito contexts it must be launched with --proxy-server
, there's no other way currently.
Then we'll have to live with it. I'm fine if the Puppeteer implementation lags behind Playwright a bit. Playwright is better and people should use that in most cases.
Merged https://github.com/apify/browser-pool/pull/53 however passing contextOptions
isn't compatible with TypeScript yet (need to cast as any
) as it expects newPage
to be exactly like the original Puppeteer newPage
function.
TypeScript support in https://github.com/apify/browser-pool/pull/54
It's now possible to do proxy per page via hooks: (browser-pool@3.0.3
) https://github.com/apify/browser-pool/blob/c345766a7fabacd98bfaad1a5c13f4a3b0f8af12/test/browser-pool.test.ts#L412-L480
Playwright Chrome on Windows requires a global proxy server to be present even if it won't be used 🤷🏼♂️ So we have to use proxy-chain
for this one. It's just one server per entire app.
implement this on the SDK level. The crawlers should use prePageCreate hooks to set new proxy to the context.
How should this look like in particular? Can we just expose prePageCreateHooks
?
I would like the following code to return 6 different IPs. Now it works the same with useIncognitoPages: true|false
. It always uses one browser and one IP.
In SDK v3 (~January 2022), we would like to switch useIncognitoPages: true
as the default. Which will help with a lot of issues and will make it work exactly the same as CheerioCrawler
, which also uses a different proxy for each request (unless a session is provided).
const Apify = require('apify');
Apify.main(async () => {
const requestList = await Apify.openRequestList(null, [
'https://api.apify.com/v2/browser-info?a=1',
'https://api.apify.com/v2/browser-info?a=2',
'https://api.apify.com/v2/browser-info?a=3',
'https://api.apify.com/v2/browser-info?a=4',
'https://api.apify.com/v2/browser-info?a=5',
'https://api.apify.com/v2/browser-info?a=6',
]);
const proxyConfiguration = await Apify.createProxyConfiguration();
const ips = [];
const crawler = new Apify.PlaywrightCrawler({
requestList,
proxyConfiguration,
launchContext: {
useIncognitoPages: true,
},
handlePageFunction: async ({ page }) => {
const el = await page.$('pre');
const json = await el.textContent();
const { clientIp } = JSON.parse(json);
ips.push(clientIp);
},
});
await crawler.run();
console.log('Used IPs:');
console.dir(ips);
});
Implementation wise, the BrowserCrawler
now ignores the session
received from BasicCrawler
and instead manages its own sessions, because we could not start a new browser for each Request
. But now we can only start a new context
, which should enable us to unify the implementation across all Crawlers.
I'm happy to have a call with you any time to explain the details if you want.
Btw, prePageCreateHooks
are already exposed, you can use them like this:
const crawler = new Apify.PlaywrightCrawler({
// ...
browserPoolOptions: {
prePageCreateHooks: [
async () => { /* hook */ }
]
},
});
Having to launch a new browser to switch a proxy is a huge performance overhead and possibly might bring other issues. Setting a proxy per page/context is now available in both Playwright and Puppeteer.
While looking into this feature I collected the following info:
pageOptions
in code and examples, but in reality thosepageOptions
arecontextOptions
. Neither Puppeteer nor Playwright accept any options incontext.newPage()
.browser.createIncognitoBrowserContext(options)
. The default is persisted.browser.newContext(contextOptions)
orbrowser.newPage(contextOptions)
as a shortcut. We can opt-in to persisted contexts when launching the browser.useIncognitoPages
top level option, which sets the type of context.Now, if I'm not wrong, we now have some kind of options for both libraries when creating contexts. And in neither of the libs there are any kind of page options. Therefore I suggest renaming
pageOptions
tocontextOptions
wherever we use them and use those to set the proxy per page.We also already have
prePageCreate
hooks which expose the ~pageOptions
~contextOptions
so we should be able to use those to set proxies per context.In the end it only looks like we need to clarify the docs here and implement this on the SDK level. The crawlers should use
prePageCreate
hooks to set new proxy to the context.Does it all make sense? cc @szmarczak @B4nan