apify / browser-pool

A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.
87 stars 14 forks source link

Proxy cache for fingerprints #64

Closed mnmkng closed 2 years ago

mnmkng commented 2 years ago

I'm not sure if it's the best solution. Right now, when I use the default generic Apify proxy URL, it will always give me the same fingerprint, but the IPs will auto-rotate. It also gives the user no control over over the use of the fingerprints. I like the got-scraping sessionToken solution better and I think we should do it this way, before we make fingerprints default in the SDK.

B4nan commented 2 years ago

I guess this one is still valid and should be prioritized given we want to enable fingerprints by default? Or something changed? cc @petrpatek

petrpatek commented 2 years ago

Yes, it is. Maybe you guys can do it the same as in got-scraping?

szmarczak commented 2 years ago

Here's the cache: https://github.com/apify/browser-pool/blob/308c4ef0a7615b5ffdb4a019bd195774cb78ee59/src/fingerprinting/hooks.ts#L25

Since the fingerprint should be per-page, I think we need to merge createFingerprintPreLaunchHook into createPrePageCreateHook first.

Then, we should create a WeakMap here. The logic needs to be replaced with something like this:

const defaultToken = {}; // WeakMap doesn't accept Symbols yet
...

const token = pageOptions.sessionToken ?? defaultToken;
if (!(weakCache.has(token))) {
    weakCache.set(token, fingerprintGenerator.getFingerprint...);
}

const fingerprint = weakCache.get(token);

I don't think there's a better way to pass the sessionToken other than via pageOptions, however I'm open for other ideas, two heads better than one :) We would need to update the pageOptions typings accordingly.

A Session needs to be passed to sessionToken.

barjin commented 2 years ago

This sounds like a rather simple fix for me (I'm probably missing something). createFingerprintPreLaunchHook works with launchContext, which - when enhanced - contains current Session -> let's just use session.id as fingerprintCache key?

This fixes @mnmkng 's dynamic proxy server problem and lets the user manage the fingerprint usage via sessionPoolOptions.

...what is that I am missing? :)