Closed barjin closed 1 year ago
Side note - as per discussion with @petrpatek - if fingerprints are enabled on a mac - it should generated corresponding macOS fingerprint, while when I was running it locally, I had the following:
DEBUG FingerprintInjector: Using fingerprint {"fingerprint":{"screen":{"availHeight":800,"availWidth":600,"pixelDepth":24,"height":800,"width":600},"webGl":{"vendor":"Google Inc.","renderer":"Google SwiftShader"},"audioCodecs":{"ogg":"probably","mp3":"probably","wav":"probably","m4a":"","aac":""},"videoCodecs":{"ogg":"probably","h264":"","webm":"probably"},"pluginsData":{},"navigator":{"cookieEnabled":true,"doNotTrack":"1","language":"en-US","languages":["en-US"],"platform":"Linux x86_64","deviceMemory":8,"hardwareConcurrency":16,"productSub":"20030107","vendor":"Google Inc.","maxTouchPoints":0},"batteryData":{"level":0.25,"chargingTime":322,"dischargingTime":null},"userAgent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36"}}
note the
"height":800,"width":600
"platform":"Linux x86_64"
"vendor":"Google Inc."
Since we already dealt with @barjin 's observations and it turned out to be fine, I checked what @AndreyBykov saw. I am not sure about the "should generate corresponding macOS fingerprint" since I only make the generators themselves, not the way they are used, but the fingerprint you observed is completely fine data-wise. Out of a sample of 1000 most recent fingerprints collected on our website, 22 had exactly these parameters, so it is not a mistake of the generators, but possibly the collected data. Maybe this is a bot that has its fingerprint values automatically generated like ours? @petrpatek do you think that's the case, or might this be a realistic combination of values?
As for the screen size, we can certainly add minimal screen resolution or precise screen resolution (or even both) as one of the inputs. Should we do that? Which option would be more comfortable for you to use?
Note: A new version of the fingerprint packages has been published with new ML models. Let's see whether this solves any aforementioned problems.
I think that the problem with Linux fingerprints generated on the macOS might be the new M1+ laptops because browser-pool tries to fill the default configuration for the platform it runs on and I think that the new m1 laptops have something else than "macos" in the process
info.
browser-pool tries to fill the default configuration
this is true, it happens here, but from what I have seen, Node on M1s still has process.platform === 'darwin'
The FingerprintGenerator
first uses HeaderGenerator
to create browser headers including a User-Agent
string - which it then uses to generate the rest of the fingerprint.
All the parameters passed to the FingerprintGenerator
are consumed by the HeaderGenerator
- and @AndreyBykov 's User-Agent
is MacOS-like. This seems like a "problem" in the collected fingerprint data where this User-Agent
/ platform
combination actually existed (see @Equidem 's response).
Now, a philosophical question - should be generating such a fingerprint considered a problem, given that such combinations actually exist? Discuss. :)
Window resolution - as some (#10) have noticed, responsive websites can change layout based on the generated screen/window size. This can easily confuse the scraper (as it's not ready for a mobile version of the site).
My proposal is to simply preprocess the data before training the ML models on them - perhaps filter out all the weird, small (but still horizontal) screen resolutions? Allowing the user to set minimum resolution could work as well, but (imho) would need bigger changes to the generative-bayesian-network
package.
cc #17
Yes, we could create some rules to filter the data and make sure we don't have bots and scrapers or any other weird stuff. I think we could do some basic filtering based on deviceMemory and hardwareConcurrency.
Forgot to mention - in the master
branch, the training data is now getting preprocessed - this should help at least against the 800x600 weird display sizes. In case there are no problems with this (so far, didn't notice anything), we can get this in the stable
branch.
@barjin Why don't you do the filtering in the prepareRecords function, it already does some filtering of the data, so it seems like a natural place to concentrate all the filters.
Smart! That's why it's not in the stable
version just yet :) will fix that, thanks for noticing
I guess the "vertical" resolution should have been solved now. Well - just rewritten the first ABC scraper to Crawlee, and it still happens. I don't have any debug logs/specific fingerprints, I see if from the number of results in output, some pages clearly load this "vertical" layout. I guess I will switch off the fingerprint generator for now :/
Maybe we should just ditch the injection of resolution (at least by default)? It seems like a weird idea to me, given websites will have different layouts based on it, you just add more randomness to the results.
oh, my bad @AndreyBykov, I was so focused on the 800x600 resolution bug that I didn't count for this. will fix over the weekend tops.
@B4nan true, screen dimensions are not the strongest fingerprint marker either. The current plan (is|was) to generate only >1280x720 landscape resolutions, which shouldn't affect the websites' layouts (1280x720 is Playwright's default resolution IIRC, that might get blocked more often). The fix is simple and should be definitive, but removing it altogether is no big deal either.
@AndreyBykov the latest stable generator version (2.0.5) should generate only landscape >1280x720 screen sizes (if devices: mobile is not selected). Feel free to try it out and leave feedback on how well it works. Thank you and sorry for the wait!
@barjin So - just ran a test, and it seems like the minimum 1280x720 resolution does the trick. I had only 2 runs, but usually, it was enough, as at least a few pages would be loaded with the portrait viewport.
Recently, I have come across complaints (ty @B4nan, @AndreyBykov ) about suspicious screen dimensions in the fingerprint-injected browsers. Those complaints were mainly about generating vertical screen dimensions for desktop devices. While there are real-life situations when a desktop computer can have a tall screen (using a vertical display for example), it really shouldn't be frequent. Following experiments might hint at more serious problems with the way the generative networks work.
The following charts show how the distribution differs between training data and generated fingerprints.
Platform distribution (not a problem anymore, just check out how well it works)
The culprit was
devices: ['desktop', 'mobile']
setting, the generator generates only desktop fingerprints by default.Note how the generated distribution is skewed towards desktop platforms. This leads to a problem while trying to generate mobile platforms with "mobile" OSs -
fg.getFingerprint({operatingSystems: ['android','ios'] })
ends withError: No headers based on this input can be generated.
Vertical screen distribution (no real problem here anymore either, just some cool charts :) )
A vertical screen was detected in 11.77 % of collected samples.
A vertical screen was detected in 11.98 % of generated data.
While the results of this experiment don't show any specific problem, there are here-and-there problems (see example in comments). Not looking into the Bayesian network internals much, there might be a problem with the fingerprint preprocessing, perhaps?
CC @petrpatek @Equidem do you guys have any idea why this might be happening?
Edit: I didn't know how the generator works, the examples show no real problems now :)