apify / fingerprint-suite

Browser fingerprinting tools for anonymizing your scrapers. Developed by Apify.
Apache License 2.0
962 stars 101 forks source link

Platform distribution in generative networks #3

Closed barjin closed 1 year ago

barjin commented 2 years ago

Recently, I have come across complaints (ty @B4nan, @AndreyBykov ) about suspicious screen dimensions in the fingerprint-injected browsers. Those complaints were mainly about generating vertical screen dimensions for desktop devices. While there are real-life situations when a desktop computer can have a tall screen (using a vertical display for example), it really shouldn't be frequent. Following experiments might hint at more serious problems with the way the generative networks work.

The following charts show how the distribution differs between training data and generated fingerprints.

Platform distribution (not a problem anymore, just check out how well it works)

The culprit was devices: ['desktop', 'mobile'] setting, the generator generates only desktop fingerprints by default.

pie
  title Platform distribution in collected data
  "Win32" : 0.4627
  "MacIntel" : 0.2874
  "Linux x86_64" : 0.1413
  "iPhone": 0.0428
  "Linux armv8l": 0.0327
  "Linux aarch64": 0.0247
  "Linux armv7l": 0.006
  "iPad": 0.0014
  "Linux armv81": 0.0005
  "Linux": 0.0002
  "OpenBSD i386": 0.0001
  "PlayStation 4": 0.0001
  "Windows": 0.0001
pie
  title Platform distribution in generated data
  "Win32": 4609
  "MacIntel": 2918
  "Linux x86_64": 1339
  "iPhone": 451
  "Linux armv8l": 337
  "Linux aarch64": 254
  "Linux armv7l": 72
  "Linux armv81": 4
  "iPad": 13
  "Linux": 2
  "Windows": 1

Note how the generated distribution is skewed towards desktop platforms. This leads to a problem while trying to generate mobile platforms with "mobile" OSs - fg.getFingerprint({operatingSystems: ['android','ios'] }) ends with Error: No headers based on this input can be generated.

Vertical screen distribution (no real problem here anymore either, just some cool charts :) )

A vertical screen was detected in 11.77 % of collected samples.

pie
  title Platform distribution - vertical screen (in collected data)
"iPhone": 0.36363636363636365
"Linux armv8l": 0.27442650807136787
"Linux aarch64": 0.20815632965165676
"MacIntel": 0.055225148683092605
"Linux armv7l": 0.05097706032285471
"Linux x86_64": 0.025488530161427356
"iPad": 0.0118946474086661
"Win32": 0.005097706032285472
"Linux armv81": 0.004248088360237893
"Linux": 0.0008496176720475786

A vertical screen was detected in 11.98 % of generated data.

pie
  title Platform distribution - vertical screen (in generated data)
  "iPhone": 451
  "Linux armv8l": 331
  "Linux aarch64": 251
  "Linux armv7l": 70
  "MacIntel": 81
  "Linux armv81": 4
  "Win32": 6
  "Linux x86_64": 27
  "iPad": 13
  "Linux": 2

While the results of this experiment don't show any specific problem, there are here-and-there problems (see example in comments). Not looking into the Bayesian network internals much, there might be a problem with the fingerprint preprocessing, perhaps?

CC @petrpatek @Equidem do you guys have any idea why this might be happening?

Edit: I didn't know how the generator works, the examples show no real problems now :)

AndreyBykov commented 2 years ago

Side note - as per discussion with @petrpatek - if fingerprints are enabled on a mac - it should generated corresponding macOS fingerprint, while when I was running it locally, I had the following:

DEBUG FingerprintInjector: Using fingerprint {"fingerprint":{"screen":{"availHeight":800,"availWidth":600,"pixelDepth":24,"height":800,"width":600},"webGl":{"vendor":"Google Inc.","renderer":"Google SwiftShader"},"audioCodecs":{"ogg":"probably","mp3":"probably","wav":"probably","m4a":"","aac":""},"videoCodecs":{"ogg":"probably","h264":"","webm":"probably"},"pluginsData":{},"navigator":{"cookieEnabled":true,"doNotTrack":"1","language":"en-US","languages":["en-US"],"platform":"Linux x86_64","deviceMemory":8,"hardwareConcurrency":16,"productSub":"20030107","vendor":"Google Inc.","maxTouchPoints":0},"batteryData":{"level":0.25,"chargingTime":322,"dischargingTime":null},"userAgent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36"}}

note the "height":800,"width":600 "platform":"Linux x86_64" "vendor":"Google Inc."

Equidem commented 2 years ago

Since we already dealt with @barjin 's observations and it turned out to be fine, I checked what @AndreyBykov saw. I am not sure about the "should generate corresponding macOS fingerprint" since I only make the generators themselves, not the way they are used, but the fingerprint you observed is completely fine data-wise. Out of a sample of 1000 most recent fingerprints collected on our website, 22 had exactly these parameters, so it is not a mistake of the generators, but possibly the collected data. Maybe this is a bot that has its fingerprint values automatically generated like ours? @petrpatek do you think that's the case, or might this be a realistic combination of values?

As for the screen size, we can certainly add minimal screen resolution or precise screen resolution (or even both) as one of the inputs. Should we do that? Which option would be more comfortable for you to use?

barjin commented 2 years ago

Note: A new version of the fingerprint packages has been published with new ML models. Let's see whether this solves any aforementioned problems.

petrpatek commented 2 years ago

I think that the problem with Linux fingerprints generated on the macOS might be the new M1+ laptops because browser-pool tries to fill the default configuration for the platform it runs on and I think that the new m1 laptops have something else than "macos" in the process info.

barjin commented 2 years ago

browser-pool tries to fill the default configuration

this is true, it happens here, but from what I have seen, Node on M1s still has process.platform === 'darwin'

The FingerprintGenerator first uses HeaderGenerator to create browser headers including a User-Agent string - which it then uses to generate the rest of the fingerprint.

All the parameters passed to the FingerprintGenerator are consumed by the HeaderGenerator - and @AndreyBykov 's User-Agent is MacOS-like. This seems like a "problem" in the collected fingerprint data where this User-Agent / platform combination actually existed (see @Equidem 's response).

Now, a philosophical question - should be generating such a fingerprint considered a problem, given that such combinations actually exist? Discuss. :)

barjin commented 2 years ago

Window resolution - as some (#10) have noticed, responsive websites can change layout based on the generated screen/window size. This can easily confuse the scraper (as it's not ready for a mobile version of the site).

My proposal is to simply preprocess the data before training the ML models on them - perhaps filter out all the weird, small (but still horizontal) screen resolutions? Allowing the user to set minimum resolution could work as well, but (imho) would need bigger changes to the generative-bayesian-network package.

cc #17

petrpatek commented 2 years ago

Yes, we could create some rules to filter the data and make sure we don't have bots and scrapers or any other weird stuff. I think we could do some basic filtering based on deviceMemory and hardwareConcurrency.

barjin commented 2 years ago

Forgot to mention - in the master branch, the training data is now getting preprocessed - this should help at least against the 800x600 weird display sizes. In case there are no problems with this (so far, didn't notice anything), we can get this in the stable branch.

Equidem commented 2 years ago

@barjin Why don't you do the filtering in the prepareRecords function, it already does some filtering of the data, so it seems like a natural place to concentrate all the filters.

barjin commented 2 years ago

Smart! That's why it's not in the stable version just yet :) will fix that, thanks for noticing

AndreyBykov commented 2 years ago

I guess the "vertical" resolution should have been solved now. Well - just rewritten the first ABC scraper to Crawlee, and it still happens. I don't have any debug logs/specific fingerprints, I see if from the number of results in output, some pages clearly load this "vertical" layout. I guess I will switch off the fingerprint generator for now :/

B4nan commented 2 years ago

Maybe we should just ditch the injection of resolution (at least by default)? It seems like a weird idea to me, given websites will have different layouts based on it, you just add more randomness to the results.

barjin commented 2 years ago

oh, my bad @AndreyBykov, I was so focused on the 800x600 resolution bug that I didn't count for this. will fix over the weekend tops.

@B4nan true, screen dimensions are not the strongest fingerprint marker either. The current plan (is|was) to generate only >1280x720 landscape resolutions, which shouldn't affect the websites' layouts (1280x720 is Playwright's default resolution IIRC, that might get blocked more often). The fix is simple and should be definitive, but removing it altogether is no big deal either.

barjin commented 2 years ago

@AndreyBykov the latest stable generator version (2.0.5) should generate only landscape >1280x720 screen sizes (if devices: mobile is not selected). Feel free to try it out and leave feedback on how well it works. Thank you and sorry for the wait!

AndreyBykov commented 2 years ago

@barjin So - just ran a test, and it seems like the minimum 1280x720 resolution does the trick. I had only 2 runs, but usually, it was enough, as at least a few pages would be loaded with the portrait viewport.

barjin commented 1 year ago

Closing as solved (since v2.1.0) Sorry for the ping, this issue kinda slipped my attention until now.